Ollama-Compatible API

Contents

Ollama-Compatible API#

SGLang provides Ollama API compatibility, allowing you to use the Ollama CLI and Python library with SGLang as the inference backend.

Prerequisites#

# Install the Ollama Python library (for Python client usage)
pip install ollama

Note: You don’t need the Ollama server installed - SGLang acts as the backend. You only need the ollama CLI or Python library as the client.

Endpoints#

Endpoint	Method	Description
`/`	GET, HEAD	Health check for Ollama CLI
`/api/tags`	GET	List available models
`/api/chat`	POST	Chat completions (streaming & non-streaming)
`/api/generate`	POST	Text generation (streaming & non-streaming)
`/api/show`	POST	Model information

Quick Start#

1. Launch SGLang Server#

python -m sglang.launch_server \
    --model Qwen/Qwen2.5-1.5B-Instruct \
    --port 30001 \
    --host 0.0.0.0

Note: The model name used with ollama run must match exactly what you passed to --model.

2. Use Ollama CLI#

# List available models
OLLAMA_HOST=http://localhost:30001 ollama list

# Interactive chat
OLLAMA_HOST=http://localhost:30001 ollama run "Qwen/Qwen2.5-1.5B-Instruct"

If connecting to a remote server behind a firewall:

# SSH tunnel
ssh -L 30001:localhost:30001 user@gpu-server -N &

# Then use Ollama CLI as above
OLLAMA_HOST=http://localhost:30001 ollama list

3. Use Ollama Python Library#

import ollama

client = ollama.Client(host='http://localhost:30001')

# Non-streaming
response = client.chat(
    model='Qwen/Qwen2.5-1.5B-Instruct',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response['message']['content'])

# Streaming
stream = client.chat(
    model='Qwen/Qwen2.5-1.5B-Instruct',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True
)
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Smart Router#

For intelligent routing between local Ollama (fast) and remote SGLang (powerful) using an LLM judge, see the Smart Router documentation.

Summary#

Component	Purpose
Ollama API	Familiar CLI/API that developers already know
SGLang Backend	High-performance inference engine
Smart Router	Intelligent routing - fast local for simple tasks, powerful remote for complex tasks