> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# GLM-5.2

## Introduction

GLM-5.2 is a large language model in the GLM (General Language Model) series, jointly developed by the KEG Laboratory
of Tsinghua University and Zhipu AI. GLM-5.2 adopts the DeepSeek-V3/V3.2 architecture, including DeepSeek Sparse
Attention (DSA) and multi-token prediction (MTP), and supports high-throughput inference with SGLang on Ascend NPUs.

This document demonstrates the deployment of GLM-5.2 on Ascend NPUs using SGLang, including single-node deployment,
multi-node deployment, prefill-decode disaggregation, feature configuration, and performance optimization.

## Supported features

| Feature              | Example usage                                                                                                                                                                                            |
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Tensor Parallelism   | `--tp-size 16`                                                                                                                                                                                           |
| Data Parallelism     | `--dp-size 16`                                                                                                                                                                                           |
| Expert Parallelism   | `--ep-size 16 \`<br />`--moe-a2a-backend deepep \`<br />`--deepep-mode auto`                                                                                                                             |
| PD Disaggregation    | `--disaggregation-mode prefill \`<br />`--disaggregation-transfer-backend ascend`                                                                                                                        |
| Quantization         | `--quantization modelslim`                                                                                                                                                                               |
| Chunked Prefill      | auto based on device memory, or set explicit value;<br />disable with `--chunked-prefill-size -1`; e.g. `--chunked-prefill-size 16384`                                                                   |
| NPU Graph            | enabled by default; disable with `--disable-cuda-graph`;<br />control range via `--cuda-graph-bs` or `--cuda-graph-max-bs`; e.g. `--cuda-graph-bs 16`                                                    |
| Speculative Decoding | `--speculative-algorithm NEXTN \`<br />`--speculative-num-steps 3 \`<br />`--speculative-eagle-topk 1 \`<br />`--speculative-num-draft-tokens 4 \`<br />`--speculative-draft-model-quantization unquant` |
| Overlap Schedule     | `export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1`                                                                                                                                                             |
| DP LM Head           | `--enable-dp-lm-head`                                                                                                                                                                                    |

<Note>
  The values in the **Example usage** column are for illustration only. Adjust them according to your hardware, deployment
  mode, and workload. For parameter details, see
  [Feature descriptions](/docs/hardware-platforms/ascend-npus/ascend_npu_optimization#feature-descriptions); for
  recommended configurations for each deployment scenario, see [Best practices](#best-practices).
</Note>

For feature compatibility and conflict information between features,
see [Feature Compatibility](/docs/hardware-platforms/ascend-npus/ascend_npu_optimization#feature-compatibility).

## Prerequisites

### Model weights

<Warning>
  If you need to download model weights, check the model size before downloading to reserve enough space.
</Warning>

* [GLM-5.2](https://huggingface.co/collections/zai-org/glm-52) (BF16)
* [GLM-5.2-w8a8](https://www.modelscope.cn/models/Eco-Tech/GLM-5.2-w8a8/) (Quantized version without MTP)
* You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantize the model naively.

Ensure the available device memory exceeds the model weight size before deployment. For optimal throughput and latency,
refer to the [best practice configurations](#best-practices) which may require additional nodes or cards.

It is recommended to download the model weights to a shared directory across multiple nodes.

## Installation

<Warning>
  Ensure sufficient disk space before pulling images. The Docker image requires at least **30 GB** of free space.
</Warning>

The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the
online platform. You can directly pull it.

<Note>
  The GLM-5.2 images below use daily build tags because 0Day support was released before the related code was merged into
  the main branch. These tags will be switched to stable release images after the support lands in a stable release.
</Note>

<Tabs>
  <Tab title="Atlas 800I A3">
    ```bash Command theme={null}
    docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann9.0.0-a3-glm5.2-20260615

    docker run -itd --shm-size=16g --name ${NAME} \
    --privileged=true --net=host \
    -v /var/queue_schedule:/var/queue_schedule \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /usr/local/sbin:/usr/local/sbin \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --device=/dev/davinci0:/dev/davinci0  \
    --device=/dev/davinci1:/dev/davinci1  \
    --device=/dev/davinci2:/dev/davinci2  \
    --device=/dev/davinci3:/dev/davinci3  \
    --device=/dev/davinci4:/dev/davinci4  \
    --device=/dev/davinci5:/dev/davinci5  \
    --device=/dev/davinci6:/dev/davinci6  \
    --device=/dev/davinci7:/dev/davinci7  \
    --device=/dev/davinci8:/dev/davinci8  \
    --device=/dev/davinci9:/dev/davinci9  \
    --device=/dev/davinci10:/dev/davinci10  \
    --device=/dev/davinci11:/dev/davinci11  \
    --device=/dev/davinci12:/dev/davinci12  \
    --device=/dev/davinci13:/dev/davinci13  \
    --device=/dev/davinci14:/dev/davinci14  \
    --device=/dev/davinci15:/dev/davinci15  \
    --device=/dev/davinci_manager:/dev/davinci_manager \
    --device=/dev/hisi_hdc:/dev/hisi_hdc \
    --entrypoint=bash \
    swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:${TAG}
    ```
  </Tab>

  <Tab title="Atlas 800I A2">
    ```bash Command theme={null}
    docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann9.0.0-910b-glm5.2-20260615

    docker run -itd --shm-size=16g --name ${NAME} \
    --privileged=true --net=host \
    -v /var/queue_schedule:/var/queue_schedule \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /usr/local/sbin:/usr/local/sbin \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --device=/dev/davinci0:/dev/davinci0  \
    --device=/dev/davinci1:/dev/davinci1  \
    --device=/dev/davinci2:/dev/davinci2  \
    --device=/dev/davinci3:/dev/davinci3  \
    --device=/dev/davinci4:/dev/davinci4  \
    --device=/dev/davinci5:/dev/davinci5  \
    --device=/dev/davinci6:/dev/davinci6  \
    --device=/dev/davinci7:/dev/davinci7  \
    --device=/dev/davinci8:/dev/davinci8  \
    --device=/dev/davinci9:/dev/davinci9  \
    --device=/dev/davinci10:/dev/davinci10  \
    --device=/dev/davinci11:/dev/davinci11  \
    --device=/dev/davinci12:/dev/davinci12  \
    --device=/dev/davinci13:/dev/davinci13  \
    --device=/dev/davinci14:/dev/davinci14  \
    --device=/dev/davinci15:/dev/davinci15  \
    --device=/dev/davinci_manager:/dev/davinci_manager \
    --device=/dev/hisi_hdc:/dev/hisi_hdc \
    --entrypoint=bash \
    swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:${TAG}
    ```
  </Tab>
</Tabs>

<Tip>
  * If the model weights have already been downloaded to a shared directory, use `-v` to mount the model path into the
    container, for example: `-v /path/to/models:/models`.
  * Replace `${NAME}` with your own container name or remove `--name` to use default name.
  * Replace `${TAG}` with the image tag for the corresponding hardware platform.
</Tip>

## Online service deployment

### Single-node deployment

Quantized model `GLM-5.2-w8a8` can be deployed on one Atlas 800I A3 node or one Atlas 800I A2 node.

<Tabs>
  <Tab title="Atlas 800I A3">
    Run the following script to execute online inference.

    ```shell theme={null}
    # ============================================================
    # Before running, update the following variables:
    #   MODEL_PATH: path to the model weights directory
    # ============================================================

    # high performance cpu
    echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    sysctl -w vm.swappiness=0
    sysctl -w kernel.numa_balancing=0
    sysctl -w kernel.sched_migration_cost_ns=50000
    # bind cpu
    export SGLANG_SET_CPU_AFFINITY=1

    unset https_proxy
    unset http_proxy
    unset HTTPS_PROXY
    unset HTTP_PROXY
    unset ASCEND_LAUNCH_BLOCKING
    # cann
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    source /usr/local/Ascend/nnal/atb/set_env.sh

    export STREAMS_PER_DEVICE=32
    export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
    export SGLANG_ENABLE_SPEC_V2=1
    # MTP OVERLAP
    export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
    export SGLANG_NPU_USE_MULTI_STREAM=1

    export HCCL_BUFFSIZE=1000
    export HCCL_OP_EXPANSION_MODE=AIV
    export HCCL_SOCKET_IFNAME=lo
    export GLOO_SOCKET_IFNAME=lo
    # DEEPEP
    export DEEPEP_NORMAL_LONG_SEQ_ROUND=72
    export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
    export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
    export DEEP_NORMAL_MODE_USE_INT8_QUANT=1

    MODEL_PATH=/path/to/model-weights

    python3 -m sglang.launch_server \
            --model-path $MODEL_PATH \
            --attention-backend ascend \
            --device npu \
            --tp-size 16 --nnodes 1 --node-rank 0 \
            --chunked-prefill-size 16384 --max-prefill-tokens 280000 \
            --trust-remote-code \
            --host 127.0.0.1 \
            --mem-fraction-static 0.7 \
            --port 8000 \
            --served-model-name glm-5 \
            --cuda-graph-bs 16 \
            --quantization modelslim \
            --speculative-draft-model-quantization unquant \
            --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  \
            --moe-a2a-backend deepep --deepep-mode auto
    ```
  </Tab>

  <Tab title="Atlas 800I A2">
    Run the following script to execute online inference.

    ```shell theme={null}
    # ============================================================
    # Before running, update the following variables:
    #   MODEL_PATH: path to the model weights directory
    # ============================================================

    export SGLANG_SET_CPU_AFFINITY=1

    unset https_proxy
    unset http_proxy
    unset HTTPS_PROXY
    unset HTTP_PROXY
    unset ASCEND_LAUNCH_BLOCKING
    # cann
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    source /usr/local/Ascend/nnal/atb/set_env.sh

    export STREAMS_PER_DEVICE=32
    export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

    export HCCL_BUFFSIZE=1000
    export HCCL_SOCKET_IFNAME=lo
    export GLOO_SOCKET_IFNAME=lo
    export TRANSFORMERS_VERBOSITY=error

    #DEEPEP
    export DEEPEP_NORMAL_LONG_SEQ_ROUND=72
    export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
    export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
    export DEEP_NORMAL_MODE_USE_INT8_QUANT=1

    MODEL_PATH=/path/to/model-weights

    python3 -m sglang.launch_server \
            --model-path $MODEL_PATH \
            --attention-backend ascend \
            --device npu \
            --tp-size 8 \
            --nnodes 1 \
            --dp-size 1 \
            --enable-dp-attention \
            --chunked-prefill-size -1 \
            --max-prefill-tokens 65536 \
            --trust-remote-code \
            --mem-fraction-static 0.9 \
            --served-model-name glm-5 \
            --cuda-graph-bs 8 \
            --max-running-requests 102 \
            --quantization modelslim \
            --speculative-draft-model-quantization unquant \
            --moe-a2a-backend deepep --deepep-mode auto \
            --load-balance-method round_robin
    ```
  </Tab>
</Tabs>

### Multi-node deployment

Quantized model `GLM-5.2-w8a8` can be deployed on two Atlas 800I A3 nodes.

Modify the IP addresses of the two nodes, then run the same script on both nodes.

```shell theme={null}
# ============================================================
# Before running, update the following variables:
#   IPS: IP addresses of each node in the cluster
#   IP_MASTER: rank 0 node IP address with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export STREAMS_PER_DEVICE=32
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
# MTP OVERLAP
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

export SGLANG_NPU_USE_MULTI_STREAM=1
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV

# Run command ifconfig on two nodes, find out which inet addr has same IP with your node IP. That is your public interface, which should be added here
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

# DEEPEP
export DEEPEP_NORMAL_LONG_SEQ_ROUND=72
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1


IPS=('<your node1 ip>' '<your node2 ip>')
IP_MASTER="${IPS[0]}:5000"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
for i in "${!IPS[@]}";
do
    if [[ "$LOCAL_HOST1" == "${IPS[$i]}" || "$LOCAL_HOST2" == "${IPS[$i]}" ]];
    then
        echo "${IPS[$i]}"
        python3 -m sglang.launch_server \
        --model-path $MODEL_PATH \
        --attention-backend ascend \
        --device npu \
        --tp-size 32 --nnodes 2 --node-rank $i --dist-init-addr $IP_MASTER \
        --chunked-prefill-size 16384 --max-prefill-tokens 131072 \
        --trust-remote-code \
        --host 127.0.0.1 \
        --mem-fraction-static 0.8 \
        --port 8000 \
        --served-model-name glm-5 \
        --cuda-graph-max-bs 32 \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  \
        --disable-radix-cache
        NODE_RANK=$i
        break
    fi
done
```

### Prefill-decode disaggregation deployment

PD disaggregation splits the prefill and decode stages onto separate nodes, reducing interference and improving
throughput for high-concurrency scenarios.

```shell theme={null}
# ============================================================
# Before running, update the following variables:
#   ASCEND_MF_STORE_URL: prefill master IP address with port
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
# pd transfer, prefill master IP
export ASCEND_MF_STORE_URL="tcp://<your prefill ip>:24707"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

P_IP=('<your prefill ip>')
D_IP=('<your decode ip>')

MODEL_PATH=/path/to/model-weights

export TRANSFORMERS_VERBOSITY=error

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEPEP_NORMAL_LONG_SEQ_ROUND=72
        export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export TASK_QUEUE_ENABLE=2
        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo

        # prefill node
        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port 8998 --trust-remote-code --nnodes 1 --node-rank $i \
        --tp-size 16 --mem-fraction-static 0.8 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 64 \
        --served-model-name glm-5 --chunked-prefill-size 524288 --max-prefill-tokens 180000 --moe-a2a-backend deepep --deepep-mode normal \
        --disable-shared-experts-fusion --disable-cuda-graph --dtype bfloat16 \
        --dp-size 4 --enable-dp-attention \
        --load-balance-method round_robin \
        --enable-dp-lm-head --moe-dense-tp 1 \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2

        # cp
        #--enable-nsa-prefill-context-parallel \
        #--nsa-prefill-cp-mode in-seq-split \
        #--attn-cp-size 4 \
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"

        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=650

        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32
        export TASK_QUEUE_ENABLE=0

        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo

        export SGLANG_NPU_USE_MULTI_STREAM=1

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8003 --trust-remote-code --nnodes 1 --node-rank $i --tp-size 16 --dp-size 16 --ep-size 16 \
        --mem-fraction-static 0.8 --max-running-requests 128 --attention-backend ascend --device npu --quantization modelslim \
        --served-model-name glm-5 --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency \
        --cuda-graph-max-bs 4 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 180000 \
        --tokenizer-worker-num 4 --prefill-round-robin-balance --disable-shared-experts-fusion --dtype bfloat16  --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
        NODE_RANK=$i
        break
    fi
done

exit 1
```

Launch the router after the prefill and decode services are ready.

```shell theme={null}
# ============================================================
# Before running, update the following variables:
#   P_MASTER_IP: prefill master IP address
#   D_MASTER_IP: decode master IP address
#   ROUTER_HOST_IP: router node IP address
# ============================================================

P_MASTER_IP="<your prefill ip>"
D_MASTER_IP="<your decode ip>"
ROUTER_HOST_IP="<your router ip>"

python3 -m sglang_router.launch_router \
--pd-disaggregation \
--policy round_robin \
--prefill http://${P_MASTER_IP}:8000 8998 \
--decode http://${D_MASTER_IP}:8003 \
--host ${ROUTER_HOST_IP} \
--port 6688
```

## Functional verification

After the service is started, you can invoke the model by sending a prompt:

```shell theme={null}
# ============================================================
# Before running, update the following variables:
#   HOST: the server host address (e.g., localhost)
#   PORT: the server port number (e.g., 8000)
# ============================================================

curl http://${HOST}:${PORT}/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "What is the capital of France?",
        "sampling_params": {
            "max_new_tokens": 64,
            "temperature": 0
        }
    }'
```

Expected result: an HTTP 200 response with the generated text containing "Paris".

Once the server prints `The server is fired up and ready to roll!` in the logs, it is ready to accept requests. For more
testing examples (Health Check, Generate, Chat Completions, and port usage guidance),
see [Testing the Service](/docs/hardware-platforms/ascend-npus/ascend_npu#testing-the-service).

## Accuracy evaluation

For accuracy evaluation methods and datasets, see [Accuracy Evaluation on Ascend NPU](/docs/hardware-platforms/ascend-npus/ascend_npu_accuracy_evaluation).

## Performance

For performance data and benchmark commands, see [Performance Testing on Ascend NPU](/docs/hardware-platforms/ascend-npus/ascend_npu_performance_testing).

## Performance tuning

For the full list of supported features, see [Supported features](#supported-features). For detailed optimization
guidance, see [Optimization on Ascend NPU](/docs/hardware-platforms/ascend-npus/ascend_npu_optimization).

## FAQ

For common environment, installation, and general parameter issues, please refer to the [Ascend NPU FAQ](/docs/hardware-platforms/ascend-npus/ascend_npu_faq).
This section only covers model-specific issues.
