Introduction
GLM-5.1 is a Mixture-of-Experts (MoE) large language model developed by Z.ai, featuring 744B total parameters with 40B active parameters. It uses 256 routed experts (top-8) plus one shared expert, with Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention (DSA), and a built-in multi-token prediction (MTP) head for speculative decoding. The model features built-in bilingual (Chinese-English) capabilities with a unified pre-training framework, excelling at reasoning, math, code, and tool calling tasks. GLM-5.1 supports both Thinking mode (step-by-step reasoning) and Instruct mode (direct response), with a native context window of approximately 200K tokens. This document demonstrates the deployment of GLM-5.1 on Ascend NPUs using SGLang, including single-node and multi-node deployment, feature configuration, and performance optimization. This document is validated and written based on SGLang v0.5.13. The current model (GLM-5.1) is fully supported in this version. To use the latest features (e.g., speculative decoding, multi-node deployment), it is recommended to use v0.5.13 or a later version.Supported features
| Feature | Example usage |
|---|---|
| Tensor Parallelism | --tp-size 16 |
| Data Parallelism | --dp-size 16 |
| Expert Parallelism | --ep-size 16 \--moe-a2a-backend deepep \--deepep-mode auto |
| Context Parallelism | --enable-nsa-prefill-context-parallel \--nsa-prefill-cp-mode in-seq-split \--attn-cp-size 4 |
| PD Disaggregation | --disaggregation-mode prefill \--disaggregation-transfer-backend ascend |
| Quantization | --quantization modelslim |
| Chunked Prefill | auto based on device memory, or set explicit value; disable with --chunked-prefill-size -1; e.g. --chunked-prefill-size 16384 |
| NPU Graph | enabled by default; disable with --disable-cuda-graph;control range via --cuda-graph-bs or --cuda-graph-max-bs; e.g. --cuda-graph-bs 1 2 3 4 5 6 |
| Speculative Decoding | --speculative-algorithm NEXTN \--speculative-num-steps 3 \--speculative-eagle-topk 1 \--speculative-num-draft-tokens 4 \--speculative-draft-model-quantization unquant |
| Overlap Schedule | export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 |
| DP LM Head | --enable-dp-lm-head |
The values in the Example usage column are for illustration only. Adjust them according to your hardware, deployment
mode, and workload. For parameter details, see
Feature descriptions; for
recommended configurations for each deployment scenario, see Best practices.
Prerequisites
Model weights
- GLM-5.1 (BF16)
- GLM-5.1-w4a8 (Quantized version)
- You can use msmodelslim to quantize the model naively.
Installation
The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the online platform. You can directly pull it. Both stable releases and daily builds are available. The following command is based on the stable release tag. For details, see Docker image versions.- Atlas 800I A3
- Atlas 800I A2
Command
Online service deployment
Multi-node PD mixed deployment
Multi-node deployment distributes the model across multiple Atlas 800I A3 nodes using tensor parallelism while keeping prefill and decode on the same nodes (PD mixed mode), suitable for scenarios that need more device memory than a single node can provide. This scenario is already covered in the best practice. For the complete, optimized deployment commands and benchmark data, see GLM-5.1 Best Practice — Multi-node PD Mixed On A3.Multi-node PD disaggregation deployment
PD disaggregation splits the prefill and decode stages onto separate nodes, reducing interference and improving throughput for high-concurrency scenarios. This scenario is already covered in the best practice. For the complete, optimized deployment commands and benchmark data, see GLM-5.1 Best Practice — PD Disaggregation On A3.Functional verification
After the service is started, you can invoke the model by sending a prompt:The server is fired up and ready to roll! in the logs, it is ready to accept requests. For more
testing examples (Health Check, Generate, Chat Completions, and port usage guidance),
see Testing the Service.
