Skip to main content

Deployment

For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.
Command
pip install --upgrade pip
pip install uv
uv pip install sglang
Then run the Python output of the command panel below in that environment.
Pick your hardware + recipe to generate the launch command. The three serving strategies cover the common operating points:
  • Low-Latency — fastest reply for a single user. Pick for chat.
  • Balanced — good speed with several users at once. Use for typical multi-user serving.
  • High-Throughput — most tokens per second across many users. Best for batch jobs.

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.

1. Model Introduction

GLM-5.2 is Z.ai’s flagship Mixture-of-Experts model built on DeepSeek Sparse Attention (DSA): a lightning indexer selects a sparse set of key tokens per query (top-2048), so attention cost stays near-constant as context grows. It ships in two precisions — FP8 (zai-org/GLM-5.2-FP8) and full BF16 (zai-org/GLM-5.2) — both with 78 transformer layers, 256 routed experts (8 active per token), a 1M-token context window, and a single MTP (Multi-Token Prediction) layer for built-in EAGLE-style speculative decoding. FP8 is the recommended deployment; BF16 (~1.5 TB) needs an 8×B300 node or a multi-node setup.
ModelArchitectureContext
GLM-5.2-FP8MoE · DSA · 256 experts (top-8) · MTP · FP81,048,576
GLM-5.2MoE · DSA · 256 experts (top-8) · MTP · BF161,048,576
Recommended generation: temperature=1.0, top_p=0.95 (the checkpoint’s generation_config.json defaults; informational — do not hardcode in client code). Resources: GLM-5.2-FP8 · GLM-5.2 (BF16).

2. Configuration Tips

  • DeepSeek Sparse Attention (DSA). GLM-5.2 uses the glm_moe_dsa architecture; SGLang auto-selects the DSA attention backends (flashmla_sparse prefill, fa3 decode, sgl-kernel indexer topk). No attention-backend flag is needed on the supported hardware.
  • MTP / speculative decoding. The checkpoint ships one nextn layer. Enable EAGLE MTP for lower latency (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 for low-latency; 1-1-2 for balanced). The config’s index_share_for_mtp_iteration reuses the DSA indexer’s topk across draft steps (effective only at --speculative-eagle-topk 1).
  • Context Parallelism (CP) for long prefill. DSA prefill CP splits the long-prefill attention across --attn-cp-size ranks. On Hopper (H200) this gives a large prefill-latency win at long context — e.g. round-robin CP (--tp 8 --attn-cp-size 8 --enable-dsa-prefill-context-parallel --dsa-prefill-cp-mode round-robin-split) cut 64K-token prefill TTFT roughly 2.5–2.8× vs. plain TP8 in our testing. Trade-offs: CP partitions the KV pool (lower max context at the same --mem-fraction-static) and adds some decode-side overhead, so it pays off only for long sequences. CP is currently verified on Hopper only — the Blackwell (sm100) DSA-CP FP8 rope kernel is not yet adapted, so leave CP off on B200/GB300.
  • Memory. The FP8 weights are large (MoE total, not active params). Start around --mem-fraction-static 0.8 on H200 (TP8) and tune up; raise it for the 4-GPU GB300 single-node layout (TP4).
  • DP-Attention + DeepEP for the balanced/high-throughput strategies spreads attention across data-parallel ranks and routes MoE through DeepEP.
  • BF16 weights need more GPUs (unverified). The full-precision build (zai-org/GLM-5.2, ~1.5 TB) does not fit a single 8×H200 / 8×B200 / 4×GB300 node. It fits single-node on 8×B300 (TP8, ~2.1 TB HBM); on the smaller GPUs it needs a multi-node layout (e.g. 2×8×H200 or 2×8×B200 at TP16, 2×4×GB300 at TP8). The BF16 recipes in the panel are proposed/inferred, not yet benchmarked (verified: false) — FP8 is the recommended deployment. Use the same DSA / MTP / chunked-prefill guidance as FP8.
  • Chunked-prefill size is regime-dependent. At long input (8K+) the default --chunked-prefill-size 2048 is too small and leaves the balanced point prefill-bound (queueing dominates TTFT). Raising it to --chunked-prefill-size 32768 on the balanced recipe gave roughly +34–78% output throughput and −39–59% TTFT on 8×H200 and 8×B200 (8K-in / 1K-out) in our testing. It is neutral for high-throughput (decode-bound there) — keep the default. --max-running-requests tracks KV capacity, not a tuning free-for-all: ~60–90 concurrent 8K+1K FP8 requests fit on a single 8-GPU node, so pin balanced near --max-running-requests 80 and let high-throughput run wider.

3. Advanced Usage

3.1 Reasoning

GLM-5.2 is a hybrid-reasoning model. Enable the glm45 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer — thinking lands in message.reasoning_content, the answer in message.content. Thinking is on by default; turn it off with chat_template_kwargs: {"thinking": False}.
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="zai-org/GLM-5.2-FP8",
    messages=[{"role": "user", "content": "What is 15% of 240?"}],
    extra_body={"chat_template_kwargs": {"thinking": True}},
)
msg = resp.choices[0].message
print("Reasoning:", getattr(msg, "reasoning_content", None))
print("Answer:", msg.content)
Output
Reasoning: 1.  **Identify the core question:** The user wants to find 15% of 240.
2.  **Convert the percentage to a decimal:** 15% = 0.15
3.  **Multiply by the total:** 0.15 * 240 = 36
    (Quick mental math: 10% of 240 = 24; 5% = 12; 24 + 12 = 36.)

Answer: 15% of 240 is **36**.

Here is how you can calculate it:
0.15 × 240 = 36

3.2 Tool Calling

Enable the glm47 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls. GLM-5.2 emits the newer <tool_call>…<arg_key>…<arg_value>… format, so it needs the glm47 parser — the older glm45 parser does not parse it (the call would be left as raw text in content). On thinking mode the turn also fills reasoning_content, so print both fields.
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]
resp = client.chat.completions.create(
    model="zai-org/GLM-5.2-FP8",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
)
msg = resp.choices[0].message
print("Reasoning:", getattr(msg, "reasoning_content", None))
print("Tool calls:", msg.tool_calls)
Output
Reasoning: The user wants to know the weather in Paris. I'll call the get_weather function with "Paris" as the city.

Tool calls: [
  {
    "id": "call_13fcd52146934b7781d06d4a",
    "type": "function",
    "function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}
  }
]

3.3 HiCache (Hierarchical KV Caching)

For long-context, prefix-heavy workloads, enable hierarchical KV caching to spill cold KV blocks to host memory (toggle the Hierarchical KV Cache card in the Playground above). Useful given GLM-5.2’s 1M-token window; pair --hicache-ratio with a write policy that matches your reuse pattern.