Anthropic-Compatible API - SGLang Documentation

SGLang ships an Anthropic-compatible /v1/messages endpoint so any client built for the Anthropic Messages API — including the Anthropic SDKs and agentic CLIs such as Claude Code — can talk to a self-hosted SGLang server without changes. A complete reference for the API is available in the Anthropic API Reference. The endpoint is registered automatically on every SGLang server; no extra flag is required to enable it. It reuses the same model, chat template, and reasoning / tool-call parsers as the OpenAI-compatible endpoint, and supports both non-streaming and streaming responses, tool use, and a count_tokens route. This tutorial covers:

POST /v1/messages (non-streaming and streaming)
POST /v1/messages/count_tokens
Pointing Claude Code at the server, including the CLAUDE_CODE_ATTRIBUTION_HEADER setting that is required for good prefix-cache reuse.

Launch A Server

Launch the server in your terminal and wait for it to initialize. The Anthropic /v1/messages endpoint is registered automatically — no extra flag is required beyond the usual server launch. The example below is a single-node GLM-5.2-FP8 config; see the GLM-5.2 cookbook for verified commands across hardware and quantizations.

Command

sglang serve \
    --model-path zai-org/GLM-5.2-FP8 \
    --tp 8 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 6 \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --host 0.0.0.0 \
    --port 30000

The endpoint is model-agnostic. The /v1/messages route is on by default for any model; GLM-5.2 is used here because its reasoning + tool-use output is where Claude Code integration shines, but any model works.
Model name and [1m]. SGLang does not validate the request model field, so Claude Code can send any name. The [1m] suffix is a client-side hint: Claude Code only enables its 1M-context beta when the model name ends in [1m] — without it, context is capped. Set the same glm-5.2[1m] in the ANTHROPIC_DEFAULT_*_MODEL env vars below.
--reasoning-parser / --tool-call-parser are optional. Add them when the model emits reasoning content (GLM-5.2, Qwen3, DeepSeek-R1, …) or when you want tool calls parsed into structured tool_use blocks. Without a tool-call parser, tool schemas are still accepted but the model’s tool calls come back as raw text, and Claude Code cannot execute them.
Context length defaults to the model’s own (1M for GLM-5.2); pass --context-length only to cap it.

Send A Message

Non-Streaming

Use the Anthropic Python SDK pointed at the server. Unlike the OpenAI SDK, the Anthropic SDK appends /v1/messages itself, so base_url is the server root without a /v1 suffix.

Example

from anthropic import Anthropic

client = Anthropic(
    base_url="http://127.0.0.1:30000",
    api_key="EMPTY",  # SGLang does not require a real key by default
)

message = client.messages.create(
    model="zai-org/GLM-5.2-FP8",
    max_tokens=512,
    messages=[{"role": "user", "content": "List 3 countries and their capitals."}],
)
# A reasoning model may emit a `thinking` block before the `text` block —
# pick the text block rather than assuming content[0].
print(next(b.text for b in message.content if b.type == "text"))

Example Output:

Output

Here are 3 countries and their capitals:

**France** - Paris
**Japan** - Tokyo
**Brazil** - Brasília

Streaming

Set stream=True to receive Server-Sent Events as they are produced.

Example

with client.messages.stream(
    model="zai-org/GLM-5.2-FP8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Say this is a test"}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Example Output:

Output

This is a test.

System Prompt

The top-level system field is accepted as a string or as a list of text blocks, matching the Anthropic API shape:

Example

message = client.messages.create(
    model="zai-org/GLM-5.2-FP8",
    max_tokens=512,
    system="You are a helpful assistant that answers concisely.",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(next(b.text for b in message.content if b.type == "text"))

Example Output:

Output

The capital of France is Paris.

Tool Use

Tool definitions follow the Anthropic tools schema. When the server is launched with a --tool-call-parser, the model’s tool calls are returned as tool_use content blocks:

Example

message = client.messages.create(
    model="zai-org/GLM-5.2-FP8",
    max_tokens=512,
    tools=[
        {
            "name": "get_weather",
            "description": "Get the weather for a city",
            "input_schema": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        }
    ],
    messages=[{"role": "user", "content": "What is the weather in Paris?"}],
)
print(message.stop_reason)
print([b for b in message.content if b.type == "tool_use"])

Example Output:

Output

tool_use
[ToolUseBlock(type='tool_use', id='toolu_01XXXX', name='get_weather', input={'city': 'Paris'})]

Counting Tokens

POST /v1/messages/count_tokens returns the tokenized length of a request without generating a response. It reuses the same request conversion as /v1/messages, so system prompts, tools, and multi-turn history are all accounted for.

Example

resp = client.messages.count_tokens(
    model="zai-org/GLM-5.2-FP8",
    messages=[{"role": "user", "content": "Hello, world"}],
)
print(resp.input_tokens)

Example Output:

Output

Using Claude Code

Claude Code can be pointed at an SGLang server by setting a few env vars in the shell that starts it. With the server already running on :30000, export the full set and launch claude:

Command

export ANTHROPIC_BASE_URL="http://127.0.0.1:30000"
export ANTHROPIC_AUTH_TOKEN="dummy"                 # required by Claude Code; any non-empty string works
export API_TIMEOUT_MS="3000000"                     # long timeout — reasoning + 1M-context turns are slow
export CLAUDE_CODE_AUTO_COMPACT_WINDOW="1000000"     # let auto-compact use the full 1M window
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1   # drop autoupdater/telemetry/error-reporting noise
export CLAUDE_CODE_ATTRIBUTION_HEADER=0             # required for prefix-cache reuse — see below
export ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-5.2[1m]"    # [1m] suffix enables Claude Code's 1M-context beta
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"  # [1m] suffix enables Claude Code's 1M-context beta
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"     # [1m] suffix enables Claude Code's 1M-context beta
claude

Each var matters:

ANTHROPIC_BASE_URL — points Claude Code at your SGLang server instead of the Anthropic API.
ANTHROPIC_AUTH_TOKEN — Claude Code requires a non-empty auth token; SGLang accepts any value when launched without --api-key.
API_TIMEOUT_MS — raise it; reasoning models with long outputs and 1M-context turns routinely exceed the default timeout.
ANTHROPIC_DEFAULT_{HAIKU,SONNET,OPUS}_MODEL — the model name Claude Code sends for each tier. SGLang does not validate this field, so any name works. Use glm-5.2[1m]: the [1m] suffix is a client-side hint that enables Claude Code’s 1M-context beta (without it, context is capped).
CLAUDE_CODE_AUTO_COMPACT_WINDOW — set to 1000000 so auto-compaction uses the full 1M window instead of the default, keeping long sessions alive.

Instead of exporting these in every shell, persist them in ~/.claude/settings.json under the env key — they apply to all Claude Code sessions:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:30000",
    "ANTHROPIC_AUTH_TOKEN": "dummy",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "glm-5.2[1m]",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.2[1m]",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "glm-5.2[1m]"
  }
}

Required: `CLAUDE_CODE_ATTRIBUTION_HEADER=0` for prefix-cache reuse

Set this whenever Claude Code routes through SGLang (or any non-Anthropic gateway). Without it, multi-turn conversations re-prefill the whole history every turn.

Claude Code prepends a per-request attribution block to the start of the system prompt, of the form x-anthropic-billing-header: cc_version=<ver>.<per-request-hash>; cc_entrypoint=...; cch=<hash>;. The per-request hash is the first token to differ between turns, so the radix prefix cache can only reuse the short prefix before that hash and re-prefills the system prompt plus the entire conversation history on every turn. Setting CLAUDE_CODE_ATTRIBUTION_HEADER=0 removes the whole attribution line from the system prompt. This is a documented Claude Code env var whose explicit purpose is to “improve prompt-cache hit rates when routing through an LLM gateway” (see the Claude Code env-vars reference).

CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC does not remove the attribution block — it only covers autoupdater/telemetry/error reporting. The attribution header is a separate code path; use CLAUDE_CODE_ATTRIBUTION_HEADER=0 for it.

Troubleshooting

Connection refused / fetch failed — Ensure the server is up and the port in ANTHROPIC_BASE_URL matches --port (default 30000). If you set ANTHROPIC_BASE_URL to a remote host, confirm it’s reachable and not behind a proxy that blocks the connection. Model not found / 404 from the server — SGLang does not validate the request model field and serves whatever model was loaded at startup, so a 404 usually means the request did not reach the /v1/messages route at all. Confirm ANTHROPIC_BASE_URL points at the server (not missing the port) and that the server finished loading. Tool calls not working / returned as raw text — Launch the server with the correct --tool-call-parser for your model (e.g. glm47, qwen3). Without it the tools field is still accepted but the model’s tool calls come back as text instead of tool_use blocks, and Claude Code cannot execute them. Slow / re-prefills the whole history every turn — You are missing CLAUDE_CODE_ATTRIBUTION_HEADER=0. Claude Code’s per-request attribution hash in the system prompt defeats radix prefix-cache reuse; see the section above. Context capped below 1M — The model name must end in [1m] for Claude Code to enable its 1M-context beta. Verify ANTHROPIC_DEFAULT_*_MODEL uses the [1m] suffix, and that the loaded model’s native context is 1M (GLM-5.2 is 1048576; pass --context-length only to cap it, not to extend).

Parameters

The /v1/messages endpoint accepts the standard Anthropic Messages API parameters. Refer to the Anthropic Messages API reference for the full list. Reasoning models are supported through the same --reasoning-parser mechanism as the OpenAI-compatible endpoint; pass the model’s reasoning kwarg via the request (e.g. thinking for DeepSeek-V3-style models, enable_thinking for Qwen3-style models). See OpenAI APIs - Completions for the reasoning-parser / chat-template mapping.

​Launch A Server

​Send A Message

​Non-Streaming

​Streaming

​System Prompt

​Tool Use

​Counting Tokens

​Using Claude Code

​Required: CLAUDE_CODE_ATTRIBUTION_HEADER=0 for prefix-cache reuse

​Troubleshooting

​Parameters

Launch A Server

Send A Message

Non-Streaming

Streaming

System Prompt

Tool Use

Counting Tokens

Using Claude Code

Required: `CLAUDE_CODE_ATTRIBUTION_HEADER=0` for prefix-cache reuse

Troubleshooting

Parameters