Deployment
Install SGLang
Install SGLang
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.Then run the Python output of the command panel below in that environment.
- Python (pip / uv)
- Docker
Command
- Low-Latency — fastest reply for a single user. Pick for chat.
- Balanced — good speed with several users at once. Use for typical multi-user serving.
- High-Throughput — most tokens per second across many users. Best for batch jobs.
Playground
The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.1. Model Introduction
GLM-5.2 is Z.ai’s flagship Mixture-of-Experts model built on DeepSeek Sparse Attention (DSA): a lightning indexer selects a sparse set of key tokens per query (top-2048), so attention cost stays near-constant as context grows. It ships in two precisions — FP8 (zai-org/GLM-5.2-FP8) and full BF16 (zai-org/GLM-5.2) — both with 78 transformer layers, 256 routed experts (8 active per token), a 1M-token context window, and a single MTP (Multi-Token Prediction) layer for built-in EAGLE-style speculative decoding. FP8 is the recommended deployment; BF16 (~1.5 TB) needs an 8×B300 node or a multi-node setup.
| Model | Architecture | Context |
|---|---|---|
| GLM-5.2-FP8 | MoE · DSA · 256 experts (top-8) · MTP · FP8 | 1,048,576 |
| GLM-5.2 | MoE · DSA · 256 experts (top-8) · MTP · BF16 | 1,048,576 |
temperature=1.0, top_p=0.95 (the checkpoint’s generation_config.json defaults; informational — do not hardcode in client code).
Resources: GLM-5.2-FP8 · GLM-5.2 (BF16).
2. Configuration Tips
- DeepSeek Sparse Attention (DSA). GLM-5.2 uses the
glm_moe_dsaarchitecture; SGLang auto-selects the DSA attention backends (flashmla_sparseprefill,fa3decode,sgl-kernelindexer topk). No attention-backend flag is needed on the supported hardware. - MTP / speculative decoding. The checkpoint ships one nextn layer. Enable EAGLE MTP for lower latency (
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4for low-latency;1-1-2for balanced). The config’sindex_share_for_mtp_iterationreuses the DSA indexer’s topk across draft steps (effective only at--speculative-eagle-topk 1). - Context Parallelism (CP) for long prefill. DSA prefill CP splits the long-prefill attention across
--attn-cp-sizeranks. On Hopper (H200) this gives a large prefill-latency win at long context — e.g. round-robin CP (--tp 8 --attn-cp-size 8 --enable-dsa-prefill-context-parallel --dsa-prefill-cp-mode round-robin-split) cut 64K-token prefill TTFT roughly 2.5–2.8× vs. plain TP8 in our testing. Trade-offs: CP partitions the KV pool (lower max context at the same--mem-fraction-static) and adds some decode-side overhead, so it pays off only for long sequences. CP is currently verified on Hopper only — the Blackwell (sm100) DSA-CP FP8 rope kernel is not yet adapted, so leave CP off on B200/GB300. - Memory. The FP8 weights are large (MoE total, not active params). Start around
--mem-fraction-static 0.8on H200 (TP8) and tune up; raise it for the 4-GPU GB300 single-node layout (TP4). - DP-Attention + DeepEP for the balanced/high-throughput strategies spreads attention across data-parallel ranks and routes MoE through DeepEP.
- BF16 weights need more GPUs (unverified). The full-precision build (
zai-org/GLM-5.2, ~1.5 TB) does not fit a single 8×H200 / 8×B200 / 4×GB300 node. It fits single-node on 8×B300 (TP8, ~2.1 TB HBM); on the smaller GPUs it needs a multi-node layout (e.g. 2×8×H200 or 2×8×B200 at TP16, 2×4×GB300 at TP8). The BF16 recipes in the panel are proposed/inferred, not yet benchmarked (verified: false) — FP8 is the recommended deployment. Use the same DSA / MTP / chunked-prefill guidance as FP8. - Chunked-prefill size is regime-dependent. At long input (8K+) the default
--chunked-prefill-size 2048is too small and leaves the balanced point prefill-bound (queueing dominates TTFT). Raising it to--chunked-prefill-size 32768on the balanced recipe gave roughly +34–78% output throughput and −39–59% TTFT on 8×H200 and 8×B200 (8K-in / 1K-out) in our testing. It is neutral for high-throughput (decode-bound there) — keep the default.--max-running-requeststracks KV capacity, not a tuning free-for-all: ~60–90 concurrent 8K+1K FP8 requests fit on a single 8-GPU node, so pin balanced near--max-running-requests 80and let high-throughput run wider.
3. Advanced Usage
3.1 Reasoning
GLM-5.2 is a hybrid-reasoning model. Enable theglm45 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer — thinking lands in message.reasoning_content, the answer in message.content. Thinking is on by default; turn it off with chat_template_kwargs: {"thinking": False}.
Reasoning Example (Python)
Reasoning Example (Python)
Example
Example Output
Example Output
Output
3.2 Tool Calling
Enable theglm47 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls. GLM-5.2 emits the newer <tool_call>…<arg_key>…<arg_value>… format, so it needs the glm47 parser — the older glm45 parser does not parse it (the call would be left as raw text in content). On thinking mode the turn also fills reasoning_content, so print both fields.
Tool Calling Example (Python)
Tool Calling Example (Python)
Example
Example Output
Example Output
Output
3.3 HiCache (Hierarchical KV Caching)
For long-context, prefix-heavy workloads, enable hierarchical KV caching to spill cold KV blocks to host memory (toggle the Hierarchical KV Cache card in the Playground above). Useful given GLM-5.2’s 1M-token window; pair--hicache-ratio with a write policy that matches your reuse pattern.