1. Model Introduction
GLM-5 is the most powerful language model in the GLM series developed by Zhipu AI, targeting complex systems engineering and long-horizon agentic tasks. Scaling from GLM-4.5’s 355B parameters (32B active) to 744B parameters (40B active), GLM-5 integrates DeepSeek Sparse Attention (DSA) to largely reduce deployment cost while preserving long-context capacity. With advances in both pre-training (28.5T tokens) and post-training via slime (a novel asynchronous RL infrastructure), GLM-5 delivers significant improvements over GLM-4.7 and achieves best-in-class performance among open-source models on reasoning, coding, and agentic tasks. Key Features:- Systems Engineering & Agentic Tasks: Purpose-built for complex systems engineering and long-horizon agentic tasks
- State-of-the-Art Performance: Best-in-class among open-source models on reasoning (HLE, AIME, GPQA), coding (SWE-bench, Terminal-Bench), and agentic tasks (BrowseComp, Vending Bench 2)
- DeepSeek Sparse Attention (DSA): Reduces deployment cost while preserving long-context capacity
- Multiple Quantizations: BF16 and FP8 variants for different performance/memory trade-offs
- Speculative Decoding: EAGLE-based speculative decoding support for lower latency
- BF16 (Full precision): zai-org/GLM-5
- FP8 (8-bit quantized): zai-org/GLM-5-FP8
2. SGLang Installation
Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities. SGLang supports serving GLM-5 on NVIDIA H100, H200, B200, and AMD MI300X/MI325X/MI355X GPUs.3.2 Configuration Tips
- Speculative decoding (MTP) can significantly reduce latency for interactive use cases.
- DP Attention: Enables data parallel attention for higher throughput under high concurrency. Note that DP attention trades off low-concurrency latency for high-concurrency throughput — disable it if your workload is latency-sensitive with few concurrent requests.
- The
--mem-fraction-staticflag is recommended for optimal memory utilization, adjust it based on your hardware and workload. - BF16 model always requires 2x GPUs compared to FP8 on NVIDIA hardware.
| Hardware | FP8 | BF16 |
|---|---|---|
| H100 | tp=16 | tp=32 |
| H200 | tp=8 | tp=16 |
| B200 | tp=8 | tp=16 |
| MI300X/MI325X | — | tp=8 |
| MI355X | — | tp=8 |
-
B200 (FP8): Use
--ep 1 --attention-backend nsa --nsa-decode-backend trtllm --nsa-prefill-backend trtllm --moe-runner-backend flashinfer_trtllm --enable-flashinfer-allreduce-fusionfor optimized NSA and MoE backends on Blackwell. Also add--quantization fp8for FP8 weight quantization. -
AMD GPUs: Use
--nsa-prefill-backend tilelang --nsa-decode-backend tilelangfor the NSA attention backend. Add--chunked-prefill-size 131072and--watchdog-timeout 1200(20 minutes for weight loading). EAGLE speculative decoding is not currently supported on AMD for GLM-5. - For other configuration tips, please refer to DeepSeek V3.2 documentation. GLM-5 and DeepSeek V3.2 share the same model structure, so the optimization techniques between these two models are also common (MTP, DSA kernel, Context Parallel…).
-
Use
--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'for GLM-5-FP8 if you want to enable the IndexCache method. This feature is supported through this PR and introduces only a small accuracy loss. However, if you are running rigorous accuracy evaluations, it is not recommended to enable this feature.
4. Model Invocation
Deploy GLM-5 with the following command (FP8 on H200, all features enabled):Command
4.1 MI300X/MI325X/MI355X (ROCm) Server Command
The following ROCm command is an additional option for AMD GPUs and does not replace the NVIDIA instructions above.Command
4.2 Basic Usage
For basic API usage and request examples, please refer to:4.3 Advanced Usage
4.3.1 Reasoning Parser
GLM-5 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned viareasoning_content in the streaming response.
To disable thinking and use Instruct mode, pass chat_template_kwargs at request time:
- Thinking mode (default): The model performs step-by-step reasoning before answering. No extra parameters needed.
- Instruct mode (
{"enable_thinking": false}): The model responds directly without a thinking process.
reasoning_content:
Example
Output
{"enable_thinking": false} via chat_template_kwargs:
Example
Output
4.3.2 Tool Calling
GLM-5 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, passextra_body={"chat_template_kwargs": {"enable_thinking": False}}.
Python Example (with Thinking Process):
Example
Output
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: H200 (8x)
- Model: GLM-5-FP8
- Tensor Parallelism: 8
- SGLang Version: commit 947927bdb
5.1.1 Latency Benchmark
Command
Output
5.1.2 Throughput Benchmark
Command
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command
Command
- Test Result
Output
5.2.2 MMLU Benchmark
- Benchmark Command
Command
- Test Result
Output
5.3 AMD GPU Benchmarks
5.3.1 GSM8K Benchmark (MI325/MI35x)
- MI325/MI35x Test (GLM-5 BF16,
tp=8, TileLang NSA backends)
Command
Output
