/v1/messages endpoint so any client built for the Anthropic
Messages API — including the Anthropic SDKs and agentic CLIs such as Claude Code — can talk to a
self-hosted SGLang server without changes. A complete reference for the API is available in the
Anthropic API Reference.
The endpoint is registered automatically on every SGLang server; no extra flag is required to enable it.
It reuses the same model, chat template, and reasoning / tool-call parsers as the OpenAI-compatible
endpoint, and supports both non-streaming and streaming responses, tool use, and a count_tokens route.
This tutorial covers:
POST /v1/messages(non-streaming and streaming)POST /v1/messages/count_tokens- Pointing Claude Code at the server, including the
CLAUDE_CODE_ATTRIBUTION_HEADERsetting that is required for good prefix-cache reuse.
Launch A Server
Launch the server in your terminal and wait for it to initialize. The Anthropic/v1/messages endpoint
is registered automatically — no extra flag is required beyond the usual server launch. The example below
is a single-node GLM-5.2-FP8 config; see the
GLM-5.2 cookbook for verified commands
across hardware and quantizations.
Command
- The endpoint is model-agnostic. The
/v1/messagesroute is on by default for any model; GLM-5.2 is used here because its reasoning + tool-use output is where Claude Code integration shines, but any model works. - Model name and
[1m]. SGLang does not validate the requestmodelfield, so Claude Code can send any name. The[1m]suffix is a client-side hint: Claude Code only enables its 1M-context beta when the model name ends in[1m]— without it, context is capped. Set the sameglm-5.2[1m]in theANTHROPIC_DEFAULT_*_MODELenv vars below. --reasoning-parser/--tool-call-parserare optional. Add them when the model emits reasoning content (GLM-5.2, Qwen3, DeepSeek-R1, …) or when you want tool calls parsed into structuredtool_useblocks. Without a tool-call parser, tool schemas are still accepted but the model’s tool calls come back as raw text, and Claude Code cannot execute them.- Context length defaults to the model’s own (1M for GLM-5.2); pass
--context-lengthonly to cap it.
Send A Message
Non-Streaming
Use the Anthropic Python SDK pointed at the server. Unlike the OpenAI SDK, the Anthropic SDK appends/v1/messages itself, so base_url is the server root without a /v1 suffix.
Example
Output
Streaming
Setstream=True to receive Server-Sent Events as they are produced.
Example
Output
System Prompt
The top-levelsystem field is accepted as a string or as a list of text blocks, matching the Anthropic
API shape:
Example
Output
Tool Use
Tool definitions follow the Anthropictools schema. When the server is launched with a
--tool-call-parser, the model’s tool calls are returned as tool_use content blocks:
Example
Output
Counting Tokens
POST /v1/messages/count_tokens returns the tokenized length of a request without generating a
response. It reuses the same request conversion as /v1/messages, so system prompts, tools, and
multi-turn history are all accounted for.
Example
Output
Using Claude Code
Claude Code can be pointed at an SGLang server by setting a few env vars in the shell that starts it. With the server already running on:30000, export the full set and launch claude:
Command
ANTHROPIC_BASE_URL— points Claude Code at your SGLang server instead of the Anthropic API.ANTHROPIC_AUTH_TOKEN— Claude Code requires a non-empty auth token; SGLang accepts any value when launched without--api-key.API_TIMEOUT_MS— raise it; reasoning models with long outputs and 1M-context turns routinely exceed the default timeout.ANTHROPIC_DEFAULT_{HAIKU,SONNET,OPUS}_MODEL— the model name Claude Code sends for each tier. SGLang does not validate this field, so any name works. Useglm-5.2[1m]: the[1m]suffix is a client-side hint that enables Claude Code’s 1M-context beta (without it, context is capped).CLAUDE_CODE_AUTO_COMPACT_WINDOW— set to1000000so auto-compaction uses the full 1M window instead of the default, keeping long sessions alive.
Required: CLAUDE_CODE_ATTRIBUTION_HEADER=0 for prefix-cache reuse
Set this whenever Claude Code routes through SGLang (or any non-Anthropic gateway). Without it,
multi-turn conversations re-prefill the whole history every turn.
x-anthropic-billing-header: cc_version=<ver>.<per-request-hash>; cc_entrypoint=...; cch=<hash>;. The
per-request hash is the first token to differ between turns, so the radix prefix cache can only reuse
the short prefix before that hash and re-prefills the system prompt plus the entire conversation history
on every turn.
Setting CLAUDE_CODE_ATTRIBUTION_HEADER=0 removes the whole attribution line from the system prompt.
This is a documented Claude Code env var whose explicit purpose is to “improve prompt-cache hit rates when
routing through an LLM gateway” (see the Claude Code env-vars reference).
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC does not remove the attribution block — it only covers
autoupdater/telemetry/error reporting. The attribution header is a separate code path; use
CLAUDE_CODE_ATTRIBUTION_HEADER=0 for it.Troubleshooting
Connection refused /fetch failed — Ensure the server is up and the port in ANTHROPIC_BASE_URL
matches --port (default 30000). If you set ANTHROPIC_BASE_URL to a remote host, confirm it’s reachable
and not behind a proxy that blocks the connection.
Model not found / 404 from the server — SGLang does not validate the request model field and
serves whatever model was loaded at startup, so a 404 usually means the request did not reach the
/v1/messages route at all. Confirm ANTHROPIC_BASE_URL points at the server (not missing the port) and
that the server finished loading.
Tool calls not working / returned as raw text — Launch the server with the correct
--tool-call-parser for your model (e.g. glm47, qwen3). Without it the tools field is still accepted
but the model’s tool calls come back as text instead of tool_use blocks, and Claude Code cannot execute
them.
Slow / re-prefills the whole history every turn — You are missing
CLAUDE_CODE_ATTRIBUTION_HEADER=0. Claude Code’s per-request attribution hash in the system prompt
defeats radix prefix-cache reuse; see the section above.
Context capped below 1M — The model name must end in [1m] for Claude Code to enable its 1M-context
beta. Verify ANTHROPIC_DEFAULT_*_MODEL uses the [1m] suffix, and that the loaded model’s native context
is 1M (GLM-5.2 is 1048576; pass --context-length only to cap it, not to extend).
Parameters
The/v1/messages endpoint accepts the standard Anthropic Messages API parameters. Refer to the
Anthropic Messages API reference for the full list.
Reasoning models are supported through the same --reasoning-parser mechanism as the OpenAI-compatible
endpoint; pass the model’s reasoning kwarg via the request (e.g. thinking for DeepSeek-V3-style models,
enable_thinking for Qwen3-style models). See OpenAI APIs - Completions for
the reasoning-parser / chat-template mapping.