GLM-5.1 - SGLang Documentation

This page focuses on optimal configuration and benchmark results for GLM-5.1 on the Ascend NPU. For environment setup, model weight download, feature configuration, and deployment instructions, etc., see the GLM-5.1 Model Tutorial.On A3 each card has 2 dies, so --tp-size is twice the card count; see Ascend NPU Reference for details.

Low Latency

Model	Hardware	Cards	Deploy Mode	Dataset	TPOT	TTFT	Quantization	Configuration
GLM-5.1	Atlas 800I A3	32	PD Disaggregation	65k+1.5k (90% prefix cache hit rate)	25ms	-	W4A8 INT8	Optimal Configuration

High Throughput

Model	Hardware	Cards	Deploy Mode	Dataset	TPOT	TTFT	Quantization	Configuration
GLM-5.1	Atlas 800I A3	16	PD Mixed	3.5k+1.5k	50ms	-	W4A8 INT8	Optimal Configuration
GLM-5.1	Atlas 800I A3	32	PD Disaggregation	128k+1k	56.4ms	13.1s	W4A8 INT8	Optimal Configuration
GLM-5.1	Atlas 800I A3	32	PD Disaggregation	16k+1k	50ms	-	W4A8 INT8	Optimal Configuration
GLM-5.1	Atlas 800I A3	32	PD Disaggregation	64k+1k	55.2ms	7.58s	W4A8 INT8	Optimal Configuration
GLM-5.1	Atlas 800I A3	32	PD Disaggregation	64k+1k	50ms	-	W4A8 INT8	Optimal Configuration
GLM-5.1	Atlas 800I A3	48	PD Disaggregation	65k+1.5k (100% prefix cache hit rate)	33ms	-	W4A8 INT8	Optimal Configuration
GLM-5.1	Atlas 800I A3	48	PD Disaggregation	128k+1k (90% prefix cache hit rate)	50ms	-	W4A8 INT8	Optimal Configuration
GLM-5.1	Atlas 800I A3	48	PD Disaggregation	64k+1k (90% prefix cache hit rate)	50ms	-	W4A8 INT8	Optimal Configuration

Optimal Configuration

GLM-5.1 W4A8 16P IN3K5 OUT1K5 50ms

Model: GLM-5.1 Hardware: Atlas 800I A3 Cards: 16 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 3.5k+1.5k TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   NODE_IPS: IP addresses of each node in the cluster
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
NODE_IPS=('<your node1 ip>' '<your node2 ip>')

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=2500
export HCCL_SOCKET_IFNAME=<network-interface>
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

for i in "${!NODE_IPS[@]}";
do
    if [[ "$LOCAL_HOST1" == "${NODE_IPS[$i]}" || "$LOCAL_HOST2" == "${NODE_IPS[$i]}" ]];
    then
        echo "${NODE_IPS[$i]}"
        python3 -m sglang.launch_server \
        --model-path $MODEL_PATH \
        --host ${NODE_IPS[$i]} --port 6688 \
        --nnodes 2 \
        --dist-init-addr ${NODE_IPS[0]}:5000 \
        --node-rank $i \
        --attention-backend ascend \
        --device npu \
        --tp-size 32 \
        --dp-size 16 \
        --enable-dp-attention \
        --chunked-prefill-size 65536 \
        --max-prefill-tokens 280000 \
        --trust-remote-code \
        --mem-fraction-static 0.65 \
        --served-model-name glm-5 \
        --cuda-graph-max-bs-decode 16 \
        --max-running-requests 256 \
        --quantization modelslim \
        --speculative-draft-model-quantization unquant \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --load-balance-method round_robin \
        --speculative-algorithm NEXTN \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47
        break
    fi
done

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 128 \
    --random-input-len 3500 \
    --random-output-len 1500 \
    --num-prompts 128 \
    --random-range-ratio 1

GLM-5.1 W4A8 1P1D 32P IN128K OUT1K 56.4ms

Model: GLM-5.1 Hardware: Atlas 800I A3 Cards: 32 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 128k+1k TPOT: 56.4ms TTFT: 13.1s

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip1>' '<your prefill ip2>')
D_IP=('<your decode ip1>' '<your decode ip2>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip1>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
        export DEEPEP_NORMAL_LONG_SEQ_ROUND=72
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export ENABLE_PROFILING=0
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export TASK_QUEUE_ENABLE=2

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --dist-init-addr ${P_IP[0]}:5000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank $i \
        --tp-size 4 \
        --nnodes 2 \
        --mem-fraction-static 0.72 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --disaggregation-transfer-backend ascend \
        --max-running-requests 16 \
        --served-model-name glm-5 \
        --chunked-prefill-size 8192 \
        --max-prefill-tokens 180000 \
        --moe-a2a-backend deepep \
        --deepep-mode normal \
        --disable-shared-experts-fusion \
        --disable-cuda-graph \
        --dtype bfloat16 \
        --speculative-draft-model-quantization unquant \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 4 \
        --enable-dp-lm-head \
        --moe-dense-tp 1 \
        --pp-size 8 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export TASK_QUEUE_ENABLE=0

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --dp-size 32 \
        --ep-size 32 \
        --enable-dp-attention \
        --mem-fraction-static 0.85 \
        --max-running-requests 32 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --served-model-name glm-5 \
        --moe-a2a-backend deepep \
        --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 \
        --disaggregation-transfer-backend ascend \
        --watchdog-timeout 9000 \
        --context-length 180000 \
        --tokenizer-worker-num 16 \
        --prefill-round-robin-balance \
        --disable-shared-experts-fusion \
        --dtype bfloat16 \
        --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

Command

# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy round_robin

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 1 \
    --random-input-len 131072 \
    --random-output-len 1024 \
    --num-prompts 1 \
    --random-range-ratio 1

GLM-5.1 W4A8 1P1D 32P IN16K OUT1K 50ms

Model: GLM-5.1 Hardware: Atlas 800I A3 Cards: 32 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 16k+1k TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip1>' '<your prefill ip2>')
D_IP=('<your decode ip1>' '<your decode ip2>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip1>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export ENABLE_PROFILING=0
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export TASK_QUEUE_ENABLE=2

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --dist-init-addr ${P_IP[0]}:5000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --mem-fraction-static 0.75 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --disaggregation-transfer-backend ascend \
        --max-running-requests 64 \
        --served-model-name glm-5 \
        --chunked-prefill-size 524288 \
        --max-prefill-tokens 180000 \
        --moe-a2a-backend deepep \
        --deepep-mode normal \
        --disable-shared-experts-fusion \
        --disable-cuda-graph \
        --dtype bfloat16 \
        --dp-size 4 \
        --enable-dp-attention \
        --load-balance-method round_robin \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 8 \
        --enable-dp-lm-head \
        --moe-dense-tp 1 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=650
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export TASK_QUEUE_ENABLE=0

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --dp-size 32 \
        --ep-size 32 \
        --enable-dp-attention \
        --mem-fraction-static 0.87 \
        --max-running-requests 96 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --served-model-name glm-5 \
        --moe-a2a-backend deepep \
        --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 \
        --disaggregation-transfer-backend ascend \
        --watchdog-timeout 9000 \
        --context-length 180000 \
        --tokenizer-worker-num 4 \
        --prefill-round-robin-balance \
        --disable-shared-experts-fusion \
        --dtype bfloat16 \
        --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

Command

# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy round_robin

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 128 \
    --random-input-len 16384 \
    --random-output-len 1024 \
    --num-prompts 512 \
    --random-range-ratio 1

GLM-5.1 W4A8 1P1D 32P IN64K OUT1K 55.2ms

Model: GLM-5.1 Hardware: Atlas 800I A3 Cards: 32 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 64k+1k TPOT: 55.2ms TTFT: 7.58s

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip1>' '<your prefill ip2>')
D_IP=('<your decode ip1>' '<your decode ip2>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip1>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
        export DEEPEP_NORMAL_LONG_SEQ_ROUND=72
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export ENABLE_PROFILING=0
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export TASK_QUEUE_ENABLE=2

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --dist-init-addr ${P_IP[0]}:5000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank $i \
        --tp-size 4 \
        --nnodes 2 \
        --mem-fraction-static 0.72 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --disaggregation-transfer-backend ascend \
        --max-running-requests 16 \
        --served-model-name glm-5 \
        --chunked-prefill-size 8192 \
        --max-prefill-tokens 180000 \
        --moe-a2a-backend deepep \
        --deepep-mode normal \
        --disable-shared-experts-fusion \
        --disable-cuda-graph \
        --dtype bfloat16 \
        --speculative-draft-model-quantization unquant \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 4 \
        --enable-dp-lm-head \
        --moe-dense-tp 1 \
        --pp-size 8 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export TASK_QUEUE_ENABLE=0

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --dp-size 32 \
        --enable-dp-attention \
        --ep-size 32 \
        --mem-fraction-static 0.85 \
        --max-running-requests 32 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --served-model-name glm-5 \
        --moe-a2a-backend deepep \
        --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 \
        --disaggregation-transfer-backend ascend \
        --watchdog-timeout 9000 \
        --context-length 180000 \
        --tokenizer-worker-num 16 \
        --prefill-round-robin-balance \
        --disable-shared-experts-fusion \
        --dtype bfloat16 \
        --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

Command

# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy round_robin

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 1 \
    --random-input-len 65536 \
    --random-output-len 1024 \
    --num-prompts 1 \
    --random-range-ratio 1

GLM-5.1 W4A8 1P1D 32P IN64K OUT1K 50ms

Model: GLM-5.1 Hardware: Atlas 800I A3 Cards: 32 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 64k+1k TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip1>' '<your prefill ip2>')
D_IP=('<your decode ip1>' '<your decode ip2>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip1>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
        export DEEPEP_NORMAL_LONG_SEQ_ROUND=72
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export ENABLE_PROFILING=0
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export TASK_QUEUE_ENABLE=2

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --dist-init-addr ${P_IP[0]}:5000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank $i \
        --tp-size 4 \
        --nnodes 2 \
        --mem-fraction-static 0.72 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --disaggregation-transfer-backend ascend \
        --max-running-requests 16 \
        --served-model-name glm-5 \
        --chunked-prefill-size 16384 \
        --max-prefill-tokens 180000 \
        --moe-a2a-backend deepep \
        --deepep-mode normal \
        --disable-shared-experts-fusion \
        --disable-cuda-graph \
        --dtype bfloat16 \
        --speculative-draft-model-quantization unquant \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 4 \
        --enable-dp-lm-head \
        --moe-dense-tp 1 \
        --pp-size 8 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export TASK_QUEUE_ENABLE=0

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --dp-size 32 \
        --enable-dp-attention \
        --ep-size 32 \
        --mem-fraction-static 0.85 \
        --max-running-requests 32 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --served-model-name glm-5 \
        --moe-a2a-backend deepep \
        --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 \
        --disaggregation-transfer-backend ascend \
        --watchdog-timeout 9000 \
        --context-length 180000 \
        --tokenizer-worker-num 16 \
        --prefill-round-robin-balance \
        --disable-shared-experts-fusion \
        --dtype bfloat16 \
        --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

Command

# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy round_robin

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 1 \
    --random-input-len 65536 \
    --random-output-len 1024 \
    --num-prompts 1 \
    --random-range-ratio 1

GLM-5.1 W4A8 1P1D 32P IN65K OUT1K5 PREFIX90 25ms

Model: GLM-5.1 Hardware: Atlas 800I A3 Cards: 32 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 65k+1.5k (90% prefix cache hit rate) TPOT: 25ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip1>' '<your prefill ip2>')
D_IP=('<your decode ip1>' '<your decode ip2>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip1>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export ENABLE_PROFILING=0
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export TASK_QUEUE_ENABLE=2

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --dist-init-addr ${P_IP[0]}:5000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --mem-fraction-static 0.75 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --disaggregation-transfer-backend ascend \
        --max-running-requests 64 \
        --served-model-name glm-5 \
        --chunked-prefill-size 524288 \
        --max-prefill-tokens 180000 \
        --moe-a2a-backend deepep \
        --deepep-mode normal \
        --disable-shared-experts-fusion \
        --disable-cuda-graph \
        --dtype bfloat16 \
        --dp-size 4 \
        --enable-dp-attention \
        --load-balance-method round_robin \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 8 \
        --enable-dp-lm-head \
        --moe-dense-tp 1 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=650
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export TASK_QUEUE_ENABLE=0

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --dp-size 32 \
        --ep-size 32 \
        --enable-dp-attention \
        --mem-fraction-static 0.87 \
        --max-running-requests 96 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --served-model-name glm-5 \
        --moe-a2a-backend deepep \
        --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 \
        --disaggregation-transfer-backend ascend \
        --watchdog-timeout 9000 \
        --context-length 180000 \
        --tokenizer-worker-num 4 \
        --prefill-round-robin-balance \
        --disable-shared-experts-fusion \
        --dtype bfloat16 \
        --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

Command

# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy round_robin

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 59904 = int(66560 * 0.9) is the shared prefix portion. --gsp-question-len 6656 = int(66560 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.

Command

python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 480 \
    --gsp-system-prompt-len 59904 \
    --gsp-question-len 6656 \
    --gsp-output-len 1536 \
    --max-concurrency 100 \
    --num-prompts 480 \
    --request-rate inf

GLM-5.1 W4A8 1P1D 48P IN65K OUT1K5 PREFIX100 33ms

Model: GLM-5.1 Hardware: Atlas 800I A3 Cards: 48 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 65k+1.5k (100% prefix cache hit rate) TPOT: 33ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip1>' '<your prefill ip2>')
D_IP=('<your decode ip1>' '<your decode ip2>' '<your decode ip3>' '<your decode ip4>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip1>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export TASK_QUEUE_ENABLE=2

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --dist-init-addr ${P_IP[0]}:5000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --mem-fraction-static 0.72 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --disaggregation-transfer-backend ascend \
        --max-running-requests 192 \
        --served-model-name glm-5 \
        --chunked-prefill-size 16384 \
        --moe-a2a-backend deepep \
        --deepep-mode normal \
        --disable-shared-experts-fusion \
        --disable-cuda-graph \
        --dtype bfloat16 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=650
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=48
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export TASK_QUEUE_ENABLE=0

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --tp-size 64 \
        --nnodes 4 \
        --dp-size 64 \
        --ep-size 64 \
        --enable-dp-attention \
        --mem-fraction-static 0.84 \
        --max-running-requests 192 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --served-model-name glm-5 \
        --moe-a2a-backend deepep \
        --deepep-mode low_latency \
        --enable-dp-lm-head \
        --moe-dense-tp 1 \
        --cuda-graph-bs 1 2 3 \
        --disaggregation-transfer-backend ascend \
        --watchdog-timeout 9000 \
        --context-length 180000 \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --tokenizer-worker-num 4 \
        --prefill-round-robin-balance \
        --disable-shared-experts-fusion \
        --dtype bfloat16 \
        --load-balance-method round_robin \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

Command

# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy round_robin

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 128 \
    --random-input-len 66560 \
    --random-output-len 1536 \
    --num-prompts 512 \
    --random-range-ratio 1

GLM-5.1 W4A8 2P1D 48P IN128K OUT1K PREFIX90 50ms

Model: GLM-5.1 Hardware: Atlas 800I A3 Cards: 48 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 128k+1k (90% prefix cache hit rate) TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip1>' '<your prefill ip2>' '<your prefill ip3>' '<your prefill ip4>')
D_IP=('<your decode ip1>' '<your decode ip2>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip1>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
        export DEEPEP_NORMAL_LONG_SEQ_ROUND=72
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export ENABLE_PROFILING=0
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export TASK_QUEUE_ENABLE=2

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --dist-init-addr ${P_IP[$(( $i / 2 * 2 ))]}:5000 \
        --disaggregation-bootstrap-port $((8998 + $i / 2)) \
        --node-rank $(( $i % 2 )) \
        --tp-size 4 \
        --nnodes 2 \
        --mem-fraction-static 0.72 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --disaggregation-transfer-backend ascend \
        --max-running-requests 32 \
        --served-model-name glm-5 \
        --chunked-prefill-size 16384 \
        --max-prefill-tokens 180000 \
        --moe-a2a-backend deepep \
        --deepep-mode normal \
        --disable-shared-experts-fusion \
        --disable-cuda-graph \
        --dtype bfloat16 \
        --speculative-draft-model-quantization unquant \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 4 \
        --enable-dp-lm-head \
        --moe-dense-tp 1 \
        --pp-size 8 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=24
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export TASK_QUEUE_ENABLE=0

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --dp-size 32 \
        --ep-size 32 \
        --enable-dp-attention \
        --mem-fraction-static 0.865 \
        --max-running-requests 96 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --served-model-name glm-5 \
        --moe-a2a-backend deepep \
        --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 4 5 6 \
        --disaggregation-transfer-backend ascend \
        --watchdog-timeout 9000 \
        --context-length 180000 \
        --tokenizer-worker-num 32 \
        --prefill-round-robin-balance \
        --disable-shared-experts-fusion \
        --dtype bfloat16 \
        --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

Command

# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip1>, <your prefill ip2>: prefill node IP addresses
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip1>:8000 8998 \
    --prefill http://<your prefill ip2>:8000 8999 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy round_robin

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 117964 = int(131072 * 0.9) is the shared prefix portion. --gsp-question-len 13107 = int(131072 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.

Command

python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 576 \
    --gsp-system-prompt-len 117964 \
    --gsp-question-len 13107 \
    --gsp-output-len 1024 \
    --max-concurrency 144 \
    --num-prompts 576 \
    --request-rate inf

GLM-5.1 W4A8 4P1D 48P IN64K OUT1K PREFIX90 50ms

Model: GLM-5.1 Hardware: Atlas 800I A3 Cards: 48 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 64k+1k (90% prefix cache hit rate) TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip1>' '<your prefill ip2>' '<your prefill ip3>' '<your prefill ip4>')
D_IP=('<your decode ip1>' '<your decode ip2>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip1>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
        export DEEPEP_NORMAL_LONG_SEQ_ROUND=72
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export ENABLE_PROFILING=0
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export TASK_QUEUE_ENABLE=2

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --disaggregation-bootstrap-port $((8998 + $i)) \
        --node-rank 0 \
        --tp-size 4 \
        --nnodes 1 \
        --mem-fraction-static 0.72 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --disaggregation-transfer-backend ascend \
        --max-running-requests 16 \
        --served-model-name glm-5 \
        --chunked-prefill-size 16384 \
        --max-prefill-tokens 180000 \
        --moe-a2a-backend deepep \
        --deepep-mode normal \
        --disable-shared-experts-fusion \
        --disable-cuda-graph \
        --dtype bfloat16 \
        --speculative-draft-model-quantization unquant \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 4 \
        --enable-dp-lm-head \
        --moe-dense-tp 1 \
        --pp-size 4 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=300
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=40
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export TASK_QUEUE_ENABLE=0

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --tp-size 32 \
        --nnodes 2 \
        --dp-size 32 \
        --ep-size 32 \
        --enable-dp-attention \
        --mem-fraction-static 0.85 \
        --max-running-requests 320 \
        --attention-backend ascend \
        --device npu \
        --quantization modelslim \
        --served-model-name glm-5 \
        --moe-a2a-backend deepep \
        --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 \
        --disaggregation-transfer-backend ascend \
        --watchdog-timeout 9000 \
        --context-length 180000 \
        --tokenizer-worker-num 4 \
        --prefill-round-robin-balance \
        --disable-shared-experts-fusion \
        --dtype bfloat16 \
        --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --reasoning-parser glm45 \
        --tool-call-parser glm47 \
        --trust-remote-code
        NODE_RANK=$i
        break
    fi
done

Command

# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip1>, <your prefill ip2>, <your prefill ip3>, <your prefill ip4>: prefill node IP addresses
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip1>:8000 8998 \
    --prefill http://<your prefill ip2>:8000 8999 \
    --prefill http://<your prefill ip3>:8000 9000 \
    --prefill http://<your prefill ip4>:8000 9001 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy round_robin

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 58982 = int(65536 * 0.9) is the shared prefix portion. --gsp-question-len 6553 = int(65536 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.

Command

python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 1280 \
    --gsp-system-prompt-len 58982 \
    --gsp-question-len 6553 \
    --gsp-output-len 1024 \
    --max-concurrency 320 \
    --num-prompts 1280 \
    --request-rate inf

​Low Latency

​High Throughput

​Optimal Configuration

​GLM-5.1 W4A8 16P IN3K5 OUT1K5 50ms

​Model Deployment

​Benchmark

​GLM-5.1 W4A8 1P1D 32P IN128K OUT1K 56.4ms

​Model Deployment

​Benchmark

​GLM-5.1 W4A8 1P1D 32P IN16K OUT1K 50ms

​Model Deployment

​Benchmark

​GLM-5.1 W4A8 1P1D 32P IN64K OUT1K 55.2ms

​Model Deployment

​Benchmark

​GLM-5.1 W4A8 1P1D 32P IN64K OUT1K 50ms

​Model Deployment

​Benchmark

​GLM-5.1 W4A8 1P1D 32P IN65K OUT1K5 PREFIX90 25ms

​Model Deployment

​Benchmark

​GLM-5.1 W4A8 1P1D 48P IN65K OUT1K5 PREFIX100 33ms

​Model Deployment

​Benchmark

​GLM-5.1 W4A8 2P1D 48P IN128K OUT1K PREFIX90 50ms

​Model Deployment

​Benchmark

​GLM-5.1 W4A8 4P1D 48P IN64K OUT1K PREFIX90 50ms

​Model Deployment

​Benchmark

Low Latency

High Throughput

Optimal Configuration

GLM-5.1 W4A8 16P IN3K5 OUT1K5 50ms

Model Deployment

Benchmark

GLM-5.1 W4A8 1P1D 32P IN128K OUT1K 56.4ms

Model Deployment

Benchmark

GLM-5.1 W4A8 1P1D 32P IN16K OUT1K 50ms

Model Deployment

Benchmark

GLM-5.1 W4A8 1P1D 32P IN64K OUT1K 55.2ms

Model Deployment

Benchmark

GLM-5.1 W4A8 1P1D 32P IN64K OUT1K 50ms

Model Deployment

Benchmark

GLM-5.1 W4A8 1P1D 32P IN65K OUT1K5 PREFIX90 25ms

Model Deployment

Benchmark

GLM-5.1 W4A8 1P1D 48P IN65K OUT1K5 PREFIX100 33ms

Model Deployment

Benchmark

GLM-5.1 W4A8 2P1D 48P IN128K OUT1K PREFIX90 50ms

Model Deployment

Benchmark

GLM-5.1 W4A8 4P1D 48P IN64K OUT1K PREFIX90 50ms

Model Deployment

Benchmark