1. Model Introduction
Available Models:- BF16 (Full precision): zai-org/GLM-5.1
- FP8 (8-bit quantized): zai-org/GLM-5.1-FP8
2. SGLang Installation
Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities. SGLang supports serving GLM-5.1 on NVIDIA H100, H200, B200, GB300, and AMD MI300X/MI325X/MI355X GPUs.3.2 Configuration Tips
- Speculative decoding (MTP) can significantly reduce latency for interactive use cases.
- DP Attention: Enables data parallel attention for higher throughput under high concurrency. Note that DP attention trades off low-concurrency latency for high-concurrency throughput — disable it if your workload is latency-sensitive with few concurrent requests.
- The
--mem-fraction-staticflag is recommended for optimal memory utilization, adjust it based on your hardware and workload. - BF16 model always requires 2x GPUs compared to FP8 on NVIDIA hardware.
| Hardware | FP8 | BF16 |
|---|---|---|
| H100 | tp=16 | tp=32 |
| H200 | tp=8 | tp=16 |
| B200 | tp=8 | tp=16 |
| GB300 | tp=4 | — |
| MI300X/MI325X | tp=8 | tp=8 |
| MI355X | tp=8 | tp=8 |
- AMD GPUs: Both BF16 and FP8 checkpoints are supported on MI300X/MI325X/MI355X at tp=8. Use
--nsa-prefill-backend tilelang --nsa-decode-backend tilelangfor the NSA attention backend. Add--chunked-prefill-size 131072and--watchdog-timeout 1200(20 minutes for weight loading). FP8 uses approximately half the memory of BF16 (~89 GB/GPU vs ~175 GB/GPU). EAGLE speculative decoding is not currently supported on AMD for GLM-5.1. - GB300: Only the FP8 checkpoint is recommended on GB300, with
tp=4. For high-throughput DP attention on GB300, use--dp 4. - For other configuration tips, please refer to DeepSeek V3.2 documentation. GLM-5.1 and DeepSeek V3.2 share the same model structure, so the optimization techniques between these two models are also common (MTP, DSA kernel, Context Parallel…).
- Use
--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'for GLM-5.1-FP8 if you want to enable the IndexCache method. This feature is supported through this PR and introduces only a small accuracy loss. However, if you are running rigorous accuracy evaluations, it is not recommended to enable this feature.
4. Model Invocation
Deploy GLM-5.1 with the following command (FP8 on H200, all features enabled):Command
4.1 MI300X/MI325X/MI355X (ROCm) Server Command
The following ROCm commands are additional options for AMD GPUs and do not replace the NVIDIA instructions above.FP8 (Recommended)
Command
BF16
Command
4.2 Basic Usage
For basic API usage and request examples, please refer to:4.3 Advanced Usage
4.3.1 Reasoning Parser
GLM-5.1 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned viareasoning_content in the streaming response.
To disable thinking and use Instruct mode, pass chat_template_kwargs at request time:
- Thinking mode (default): The model performs step-by-step reasoning before answering. No extra parameters needed.
- Instruct mode (
{"enable_thinking": false}): The model responds directly without a thinking process.
reasoning_content:
Example
Output
{"enable_thinking": false} via chat_template_kwargs:
Example
Output
4.3.2 Tool Calling
GLM-5.1 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, passextra_body={"chat_template_kwargs": {"enable_thinking": False}}.
Python Example (with Thinking Process):
Example
Output
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: H200 (8x)
- Model: GLM-5.1-FP8
- Tensor Parallelism: 8
- SGLang Version: commit 947927bdb
5.1.1 Latency Benchmark
Command
Output
5.1.2 Throughput Benchmark
Command
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command
Command
- Test Result
Output
5.2.2 MMLU Benchmark
- Benchmark Command
Command
- Test Result
Output
5.3 AMD GPU Benchmarks
5.3.1 GSM8K Benchmark (MI325/MI35x)
- MI325/MI35x Test (GLM-5.1 BF16,
tp=8, TileLang NSA backends)
Command
Output
