DeepSeek Series Models
Low Latency
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 6K+1.6K | 20ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.9K+1K | 19ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.5K+1.5K | 19ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.5K+1K | 19ms | W8A8 INT8 | Optimal Configuration |
| DeepSeek-V3.2 | Atlas 800I A3 | 32 | PD Disaggregation | 128K+1K | 26ms | W8A8 INT8 | Optimal Configuration |
High Throughput
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 24 | PD Disaggregation | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 8 | PD Mixed | 2K+2K | 50ms | W4A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 16 | PD Disaggregation | 2K+2K | 50ms | W4A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W4A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 16 | PD Disaggregation | 3.5K+1.5K | 50ms | W4A8 INT8 | Optimal Configuration |
Qwen Series Models
Low Latency
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 11K+1K | 10ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 4 | PD Mixed | 6K+1.5K | 18ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 4 | PD Mixed | 4K+1.5K | 11ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 8 | PD Mixed | 18K+4K | 6ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 6K+1.5K | 18ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 4K+1.5K | 11ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 1K+0.3K | 12ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 6K+1.5K | 17ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 1K+0.3K | 7ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 6K+1.5K | 12ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 5ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-30B-A3B | Atlas 800I A3 | 1 | PD Mixed | 6K+1.5K | 10ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-30B-A3B | Atlas 800I A3 | 1 | PD Mixed | 1K+0.3K | 7ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Next-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 1K+0.3K | 14.21ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Next-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 6K+1.5K | 15.62ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Next-A3B-Instruct | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 20ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-14B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 9ms | W8A8 INT8 | Optimal Configuration |
| Qwen3.5-27B | Atlas 800I A3 | 2 | PD Mixed | 3.5K+1.5K | 20ms | W8A8 INT8 | Optimal Configuration |
| Qwen3.5-27B | Atlas 800I A3 | 1 | PD Mixed | 16K+1K | 20ms | W8A8 INT8 | Optimal Configuration |
| Qwen3.5-27B | Atlas 800I A3 | 1 | PD Mixed | 64K+1K | 20ms | W8A8 INT8 | Optimal Configuration |
| Qwen3.5-397B-A17B | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 22ms | W4A8 | Optimal Configuration |
High Throughput
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Qwen3-235B-A22B | Atlas 800I A3 | 24 | PD Disaggregation | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-235B-A22B | Atlas 800I A3 | 16 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-30B-A3B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Coder-480B-A35B-Instruct | Atlas 800I A3 | 24 | PD Disaggregation | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Coder-480B-A35B-Instruct | Atlas 800I A3 | 16 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Coder-480B-A35B-Instruct | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Next-80B-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-14B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3.5-27B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3.5-27B | Atlas 800I A3 | 2 | PD Mixed | 16K+1K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3.5-27B | Atlas 800I A3 | 2 | PD Mixed | 64K+1K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3.5-397B-A17B | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W4A8 | Optimal Configuration |
MiniMax Series Models
Low Latency
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| MiniMax-M2.5 | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 20ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 8 | PD Mixed | 128K+1K | 20ms | W8A8 INT8 | Optimal Configuration |
High Throughput
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| MiniMax-M2.5 | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 8 | PD Mixed | 64K+1K | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 8 | PD Mixed | 128K+1K | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 4 | PD Mixed | 64K+1K | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 16 | PD Disaggregation | 64K+1K | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 16 | PD Disaggregation | 128K+1K | 50ms | W8A8 INT8 | Optimal Configuration |
Kimi Series Models
Low Latency
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Kimi-K2.5-w4a8 | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 20ms | W4A8 INT8 | Optimal Configuration |
High Throughput
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Kimi-K2.5-w4a8 | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W4A8 INT8 | Optimal Configuration |
GLM Series Models
High Throughput
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| GLM-5.1 | Atlas 800I A3 | 16 | PD Mixed | 3.5K+1.5K | 41ms | W4A8 | Optimal Configuration |
| GLM-5.1 | Atlas 800I A3 | 32 | PD Disaggregation | 16K+1K | 23ms | W4A8 | Optimal Configuration |
| GLM-5.1 | Atlas 800I A3 | 48 | PD Disaggregation | 64K+1K+90% cache hit | 45ms | W4A8 | Optimal Configuration |
| GLM-5.1 | Atlas 800I A3 | 48 | PD Disaggregation | 128K+1K+90% cache hit | 32ms | W4A8 | Optimal Configuration |
Optimal Configuration
DeepSeek-R1 3_5K-1_5K 50ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"
P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export SGLANG_USE_AG_AFTER_QLORA=1
export HCCL_BUFFSIZE=800
export TASK_QUEUE_ENABLE=2
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=131072
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.778 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 16 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 60000 --moe-a2a-backend ascend_fuseep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=600
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
export TASK_QUEUE_ENABLE=1
export SGLANG_NPU_FUSED_MOE_MODE=1
export SGLANG_LM_HEAD_TP=8
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \
--mem-fraction-static 0.82 --max-running-requests 1024 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --moe-dense-tp 1 \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--prefill http://P_IP:8000 8999 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 1024 --random-input-len 3584 --random-output-len 1536 --num-prompts 7168 --random-range-ratio 1 --request-rate 40
DeepSeek-R1 2K-2K 50ms on A3 24 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 24Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"
P_IP=('your prefill ip1')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=1600
export TASK_QUEUE_ENABLE=2
export SGLANG_USE_AG_AFTER_QLORA=1
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.8 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 20 --context-length 8192 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=800
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=102
export TASK_QUEUE_ENABLE=1
export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
export SGLANG_NPU_FUSED_MOE_MODE=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \
--mem-fraction-static 0.81 --max-running-requests 1088 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--prefill http://P_IP:8000 8999 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 1088 \
--random-input-len 2048 \
--random-output-len 2048 \
--num-prompts 12800 \
--random-range-ratio 1 \
--request-rate 24
DeepSeek-R1 6K-1_6K 20ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 6K+1.6K TPOT: 20msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"
P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=1536
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 4 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=650
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16
export TASK_QUEUE_ENABLE=1
export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 8 \
--mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \
--cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--prefill http://P_IP:8000 8999 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 32 \
--random-input-len 6000 \
--random-output-len 1600 \
--num-prompts 32 \
--random-range-ratio 1 \
--request-rate 16
DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.9K+1K TPOT: 19msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"
P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=1536
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 4 --context-length 8192 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=650
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=12
export TASK_QUEUE_ENABLE=1
export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 16 \
--mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \
--cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--prefill http://P_IP:8000 8999 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 32 \
--random-input-len 3900 \
--random-output-len 1024 \
--num-prompts 32 \
--random-range-ratio 1 \
--request-rate 16
DeepSeek-R1 3_5K-1_5K 19ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 19msModel Deployment
Please Turn to DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation ModeBenchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 32 \
--random-input-len 3500 \
--random-output-len 1500 \
--num-prompts 32 \
--random-range-ratio 1 \
--request-rate 16
DeepSeek-R1 3_5K-1K 19ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1K TPOT: 19msModel Deployment
Please Turn to DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation ModeBenchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 32 \
--random-input-len 3500 \
--random-output-len 1024 \
--num-prompts 32 \
--random-range-ratio 1 \
--request-rate 16
DeepSeek-R1 2K-2K 50ms on A3 8 Cards Mixed Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=88
export HCCL_BUFFSIZE=1600
export DEEPEP_NORMAL_LONG_SEQ_ROUND=10
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512
MODEL_PATH=xxx
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_USE_FIA_NZ=1
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 16 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--cuda-graph-bs 4 8 20 21 22 \
--mem-fraction-static 0.78 \
--max-running-requests 352 \
--disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 1500 \
--moe-a2a-backend deepep --deepep-mode auto \
--enable-dp-attention --dp-size 16 --enable-dp-lm-head \
--speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \
--dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 352 --random-input-len 2048 --random-output-len 2048 --num-prompts 1408 --random-range-ratio 1
DeepSeek-R1 2K-2K 50ms on A3 16 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667"
P_IP=('your prefill ip1')
D_IP=('your decode ip1')
MODEL_PATH=xxx
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ENABLE_MOE_NZ=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=2600
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.7 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 10240 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=900
export SGLANG_DP_ROUND_ROBIN=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=112
export TASK_QUEUE_ENABLE=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \
--mem-fraction-static 0.8 --max-running-requests 448 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 448 --random-input-len 2048 --random-output-len 2048 --num-prompts 1792 --random-range-ratio 1 --request-rate 32
DeepSeek-R1 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=56
export HCCL_BUFFSIZE=1200
export DEEPEP_NORMAL_LONG_SEQ_ROUND=10
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_USE_FIA_NZ=1
MODEL_PATH=xxx
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 16 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--cuda-graph-bs 4 8 12 14 \
--mem-fraction-static 0.77 \
--max-running-requests 224 \
--context-length 8188 --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 3000 \
--moe-a2a-backend deepep --deepep-mode auto \
--enable-dp-attention --dp-size 16 --enable-dp-lm-head \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 224 --random-input-len 3500 --random-output-len 1500 --num-prompts 896 --random-range-ratio 1
DeepSeek-R1 3_5K-1_5K 50ms on A3 16 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667"
P_IP=('your prefill ip1')
D_IP=('your decode ip1')
MODEL_PATH=xxx
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ENABLE_MOE_NZ=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=3500
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.62 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 20480 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=800
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=78
export TASK_QUEUE_ENABLE=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \
--mem-fraction-static 0.805 --max-running-requests 416 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \
--disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 416 --random-input-len 3500 --random-output-len 1500 --num-prompts 1664 --random-range-ratio 1
DeepSeek-V3.2 128K-1K 26ms on A3 32 Cards Disaggregation Mode
Model: DeepSeek-V3.2-W8A8 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 128K+1K TPOT: 26msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24670"
P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=1200
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 32 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host ${P_IP[$i]} --port 8000 \
--mem-fraction-static 0.73 \
--disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 68000 \
--max-running-requests 1 \
--moe-a2a-backend deepep --deepep-mode normal \
--quantization modelslim \
--disaggregation-transfer-backend ascend \
--disaggregation-mode prefill \
--disable-cuda-graph \
--nnodes 2 --node-rank $i \
--disaggregation-bootstrap-port 8995 \
--moe-dense-tp-size 1 \
--enable-dsa-prefill-context-parallel \
--dsa-prefill-cp-mode in-seq-split \
--attn-cp-size 32 \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dist-init-addr ${P_IP[0]}:10000
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export TASK_QUEUE_ENABLE=0
export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
DP=8
export HCCL_BUFFSIZE=400
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=8
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 32 \
--dp ${DP} \
--ep 32 \
--moe-dense-tp-size 1 \
--enable-dp-attention \
--enable-dp-lm-head \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host ${D_IP[$i]} --port 8001 \
--mem-fraction-static 0.79 \
--disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 68000 \
--max-running-requests 32 \
--cuda-graph-max-bs 4 \
--moe-a2a-backend deepep \
--deepep-mode low_latency \
--quantization modelslim \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disaggregation-transfer-backend ascend \
--disaggregation-mode decode \
--nnodes 2 --node-rank $i \
--dist-init-addr ${D_IP[0]}:10000
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP1:8000 8995 \
--decode http://D_IP1:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 8 --random-input-len 131076 --random-output-len 1024 --num-prompts 8 --random-range-ratio 1
Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 24Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_DP_ROUND_ROBIN=1
export SGLANG_NPU_FUSED_MOE_MODE=2
MODEL_PATH=xxx
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667"
P_IP=('your prefill ip1')
D_IP=('your decode ip1' 'your decode ip2')
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
export DEEPEP_NORMAL_LONG_SEQ_ROUND=16
export HCCL_BUFFSIZE=4300
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export STREAMS_PER_DEVICE=32
# Prefill
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \
--host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \
--nnodes 1 --node-rank $i --tp-size 16 --dp-size 16 --mem-fraction-static 0.6 \
--disable-radix-cache \
--attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--max-running-requests 128 --chunked-prefill-size 94208 --max-prefill-tokens 262144 \
--enable-dp-attention \
--moe-a2a-backend ascend_fuseep --dtype bfloat16
NODE_RANK=$i
break
fi
done
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export DP_ROUND_ROBIN=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536
export HCCL_BUFFSIZE=800
export HCCL_SOCKET_IFNAME=data0.3001
export GLOO_SOCKET_IFNAME=data0.3001
export STREAMS_PER_DEVICE=32
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \
--host ${D_IP[$i]} --port 8001 --trust-remote-code \
--nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.83 --max-running-requests 768 \
--attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \
--moe-a2a-backend ascend_fuseep --cuda-graph-bs 6 8 12 15 18 20 22 24 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-draft-model-quantization unquant \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--dist-init-addr xxx:5000 \
--disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://PIP:8000 8995 \
--decode http://DIP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang-oai --host 127.0.0.1 --port 7239 --max-concurrency 860 --random-input-len 3500 --random-output-len 1500 --num-prompts 3440 --random-range-ratio 1
Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=570
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416
export SGLANG_NPU_FUSED_MOE_MODE=2
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 432 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 94208 --max-prefill-tokens 458880 --sampling-backend ascend \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disable-radix-cache --moe-a2a-backend ascend_fuseep --speculative-draft-model-quantization unquant \
--tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.8 --cuda-graph-bs 1 2 4 8 16 20 24 26 27
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 272 --random-input-len 3500 --random-output-len 1500 --num-prompts 1088 --random-range-ratio 1
Qwen3-235B-A22B 2K-2K 50ms on A3 8 Cards Mixed Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=450
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=147456
export SGLANG_NPU_FUSED_MOE_MODE=2
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 624 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 73728 --max-prefill-tokens 458880 --speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disable-radix-cache --moe-a2a-backend ascend_fuseep \
--tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.83 --cuda-graph-bs 4 8 16 24 28 29 30 32 34 36 37 38 39
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 480 --random-input-len 2048 --random-output-len 2048 --num-prompts 480 --random-range-ratio 1
Qwen3-235B-A22B 2K-2K 50ms on A3 16 Cards Mixed Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 16Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=1600
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
MIX_IP=('IP1' 'IP2')
for i in "${!MIX_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]];
then
echo "${MIX_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path ${MODEL_PATH} \
--host 127.0.0.1 --port 7439 --trust-remote-code \
--nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.8 --max-running-requests 768 \
--attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \
--moe-a2a-backend deepep --deepep-mode auto --cuda-graph-bs 6 8 10 12 18 24 \
--dist-init-addr ${MIX_IP[0]}:5000 --chunked-prefill-size 131072 --max-prefill-tokens 458880 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx --speculative-draft-model-quantization= unquant \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--context-length 8192 --disable-radix-cache \
--enable-dp-lm-head --dtype bfloat16
NODE_RANK=$i
break
fi
done
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 768 --random-input-len 2000 --random-output-len 2000 --num-prompts 768 --random-range-ratio 1
Qwen3-235B-A22B 11K-1K 10ms on A3 8 Cards Mixed Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 11K+1K TPOT: 10msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=1600
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 1 --dtype bfloat16 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 --speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--disable-radix-cache --enable-dp-lm-head \
--tp 16 --mem-fraction-static 0.78 --cuda-graph-bs 1
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 1 --random-input-len 11000 --random-output-len 1000 --num-prompts 1 --random-range-ratio 1
Qwen3-32B 6K-1_5K 18ms on A3 4 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 4Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 18msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 32 \
--disable-radix-cache \
--chunked-prefill-size 24576 --max-prefill-tokens 65536 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1
Qwen3-32B 4K-1_5K 11ms on A3 4 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 4Card DeployMode: PD Mixed Dataset: random Input Output Length: 4K+1.5K TPOT: 11msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 1 \
--disable-radix-cache \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size 24576 --max-prefill-tokens 65536 \
--tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4
Qwen3-32B 18K-4K 6ms on A3 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 18K+4K TPOT: 6msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 1 \
--disable-radix-cache --speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 65536 \
--tp-size 16 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 18000 --random-input-len 4000 --num-prompts 1
Qwen3-32B 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 78 \
--disable-radix-cache --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-prefill-tokens 49152 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--tp-size 4 --mem-fraction-static 0.72 --cuda-graph-bs 16 32 64 68 72 78 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1
Qwen3-32B 2K-2K 50ms on A3 2 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 120 \
--disable-radix-cache --speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--chunked-prefill-size -1 --max-prefill-tokens 49152 \
--tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 480 --random-range-ratio 1
Qwen3-30B-A3B 3_5K-1_5K 50ms on A3 1 Card Mixed Mode
Model: Qwen3-30B-A3B-Instruct-2507 Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export ASCEND_LAUNCH_BLOCKING=0
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 162 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--chunked-prefill-size -1 --max-prefill-tokens 35000 \
--tp-size 2 --mem-fraction-static 0.87 --cuda-graph-bs 1 5 15 40 70 100 120 130 140 146 150 154 156 158 160 162 \
--dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 156 --random-input-len 3500 --random-output-len 1500 --num-prompts 624 --random-range-ratio 1
Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode
Model: Qwen3-Coder-480B-A35B-Instruct Hardware: Atlas 800I A3 24Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_NPU_FUSED_MOE_MODE=2
MODEL_PATH=xxx
export ASCEND_MF_STORE_URL="tcp://PIP:24667"
P_IP=('PIP')
D_IP=('DIP1' 'DIP2')
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=327680
export HCCL_BUFFSIZE=1550
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \
--host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \
--nnodes 1 --node-rank $i --tp-size 16 --dp-size 2 --mem-fraction-static 0.7 \
--disable-radix-cache \
--attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \
--max-running-requests 16 --chunked-prefill-size 20480 --max-prefill-tokens 20480 \
--enable-dp-attention \
--moe-a2a-backend ascend_fuseep --dtype bfloat16 \
--disable-overlap-schedule
NODE_RANK=$i
break
fi
done
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536
export HCCL_BUFFSIZE=600
export SGLANG_NPU_FUSED_MOE_MODE=2
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \
--host ${D_IP[$i]} --port 8001 --trust-remote-code \
--nnodes 2 --node-rank $i --tp-size 32 --dp-size 4 --mem-fraction-static 0.75 --max-running-requests 544 \
--attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \
--moe-a2a-backend ascend_fuseep --cuda-graph-bs 16 32 56 72 80 88 96 104 112 120 128 136 \
--dist-init-addr DIP1:5000 \
--disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 --load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://PIP:8000 8995 \
--decode http://DIP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 410 --random-input-len 3500 --random-output-len 1500 --num-prompts 1640 --random-range-ratio 1 --request-rate 8
Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 16 Cards Mixed Mode
Model: Qwen3-Coder-480B-A35B-Instruct Hardware: Atlas 800I A3 16Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=72
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=1800
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
MIX_IP=('IP1' 'IP2')
for i in "${!MIX_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]];
then
echo "${MIX_IP[$i]}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 2 --node-rank $i \
--dist-init-addr 141.61.133.128:5000 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 288 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 114688 --max-prefill-tokens 458880 \
--disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto \
--tp 32 --dp-size 4 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.7 --cuda-graph-bs 56 64 72
NODE_RANK=$i
break
fi
done
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 288 --random-input-len 3500 --random-output-len 1500 --num-prompts 1152 --random-range-ratio 1 --request-rate 20
Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode
Model: Qwen3-Coder-480B-A35B-Instruct Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=2100
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 80 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 28672 --max-prefill-tokens 458880 \
--disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --enable-dp-attention --enable-dp-lm-head \
--tp 16 --dp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 16 20 24
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 80 --random-input-len 3500 --random-output-len 1500 --num-prompts 320 --random-range-ratio 1
Qwen3-Next-80B-A3B-Instruct 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode
Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
export cann_path=/usr/local/Ascend/ascend-toolkit/latest
source /usr/local/Ascend/driver/bin/setenv.bash
source ${cann_path}/../set_env.sh
source ${cann_path}/../../nnal/atb/set_env.sh
source ${cann_path}/opp/vendors/customize/bin/set_env.bash
export ASCEND_HOME_PATH=${cann_path}
source /usr/local/Ascend/8.5.0/bisheng_toolkit/set_env.sh
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
export LD_LIBRARY_PATH=/usr/local/Ascend/cann-9.0.0/opp/vendors/custom_transformer/op_api/lib:${LD_LIBRARY_PATH}
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_ALGO="level0:NA;level1:ring"
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330
export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=0
export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1
ZBAL_HCCL_OP="allreduce,_allgather_base,allgather,broadcast,scatter,reduce_scatter,_reduce_scatter_base,alltoall_base"
export HCCL_BUFFSIZE=64
export SGLANG_ZBAL_LOCAL_MEM_SIZE=59648
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://127.0.0.1:24669"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export ZBAL_ENABLE_GRAPH=1
MODEL_PATH=/home/weights/Qwen3-Next-80B-A3B-Instruct-W8A8
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--page-size 128 \
--tp-size 4 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--mem-fraction-static 0.75 \
--disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \
--chunked-prefill-size -1 --max-running-requests 300 \
--mamba-ssm-dtype bfloat16 \
--quantization modelslim \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
--speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \
--dp-size 2 --enable-dp-attention --enable-dp-lm-head \
--moe-a2a-backend deepep --deepep-mode auto \
--cuda-graph-bs 1 2 3 4 5 6 7 8 10 12 14 16 18 20 22 24 26 28 30 32 40 44 48 52 56 60 64 72 80 88 96 104 112 120 128 136 144 150
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 300 --random-output-len 1536 --random-input-len 3584 --num-prompts 300 --random-range-ratio 1
Qwen3-32B 6K-1_5K 18ms on A2 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 18msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 32 \
--disable-radix-cache \
--chunked-prefill-size 24576 --max-prefill-tokens 65536 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1
Qwen3-32B 4K-1_5K 11ms on A2 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 4K+1.5K TPOT: 11msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 32 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 65536 \
--tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 6 12 18 24 30 32 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4
Qwen3-32B 1K-0_3K 12ms on A3 2 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 12msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 \
--tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 8 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16
Qwen3-32B 6K-1_5K 17ms on A3 2 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 17msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 \
--tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 10 15 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16
Qwen3-8B 1K-0_3K 7ms on A3 1 Cards Mixed Mode
Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 7msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 \
--tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 2 4 6 9 10 15 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16
Qwen3-8B 6K-1_5K 12ms on A3 1 Cards Mixed Mode
Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 12msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 \
--tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 5 15 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16
Qwen3-32B 3_5K-1_5K 50ms on A2 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 78 \
--disable-radix-cache --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-prefill-tokens 65536 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--tp-size 4 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 8 16 32 64 68 72 78 --dtype bfloat16 --base-gpu-id 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1
Qwen3-32B 2K-2K 50ms on A2 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 120 \
--disable-radix-cache \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-prefill-tokens 49152 --base-gpu-id 4 \
--tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 120 --random-range-ratio 1
Qwen3-30B-A3B 6K-1_5K 10ms on A3 1 Cards Mixed Mode
Model: Qwen3-30B-A3B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 10msModel Deployment
Command
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 35000 \
--tp-size 2 --mem-fraction-static 0.6 --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16
Qwen3-30B-A3B 1K-0_3K 7ms on A3 1 Cards Mixed Mode
Model: Qwen3-30B-A3B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 7msModel Deployment
Command
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 8 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 35000 \
--tp-size 2 --mem-fraction-static 0.7 --cuda-graph-bs 1 2 3 4 5 6 7 8 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 8 --random-output-len 300 --random-input-len 1024 --num-prompts 8
Qwen3-Next 1K-0_3K 14_21ms on A3 2 Cards Mixed Mode
Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 14.21msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330
export DEEPEP_NORMAL_LONG_SEQ_ROUND=5
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=2000
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--page-size 128 \
--tp-size 4 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--mem-fraction-static 0.75 \
--disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-running-requests 312 \
--cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \
--mamba-ssm-dtype bfloat16 \
--base-gpu-id 0 \
--speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \
--moe-a2a-backend deepep --deepep-mode auto \
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16
Qwen3-Next 6K-1_5K 15_62ms on A3 2 Cards Mixed Mode
Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 15.62msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330
export DEEPEP_NORMAL_LONG_SEQ_ROUND=5
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=2000
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--page-size 128 \
--tp-size 4 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--mem-fraction-static 0.75 \
--disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-running-requests 312 \
--cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \
--mamba-ssm-dtype bfloat16 \
--base-gpu-id 0 \
--speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \
--moe-a2a-backend deepep --deepep-mode auto \
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16
Qwen3-14B 3_5K-1_5K 9ms on A3 1 Cards Mixed Mode
Model: Qwen3-14B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 9msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export ASCEND_USE_FIA=0
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--disable-radix-cache --mem-fraction-static 0.8 \
--tp-size 1 --dp-size 1 \
--sampling-backend ascend --max-running-requests 8 \
--served-model-name Qwen3-14B \
--chunked-prefill-size -1 \
--cuda-graph-bs 8 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--schedule-conservativeness 0.01
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 8 --random-range-ratio 1
Qwen3-14B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode
Model: Qwen3-14B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export ASCEND_USE_FIA=0
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--disable-radix-cache --mem-fraction-static 0.89 \
--tp-size 1 --dp-size 2 \
--sampling-backend ascend --max-running-requests 144 \
--max-prefill-tokens 12288 \
--served-model-name Qwen3-14B \
--chunked-prefill-size -1 \
--cuda-graph-bs 8 16 32 44 48 50 52 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--schedule-conservativeness 0.01
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 144 --random-output-len 1500 --random-input-len 3500 --num-prompts 576 --random-range-ratio 1
Qwen3-8B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode
Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=50
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--disable-radix-cache --mem-fraction-static 0.9 \
--tp-size 1 \
--max-running-requests 70 \
--max-prefill-tokens 16384 \
--served-model-name Qwen3-8B \
--chunked-prefill-size 16384 \
--cuda-graph-bs 8 12 24 36 48 51 55 60 63 64 66 68 70 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 64 --random-output-len 1500 --random-input-len 3500 --num-prompts 256 --random-range-ratio 1
Qwen3-8B 3_5K-1_5K 5ms on A3 1 Cards Mixed Mode
Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 5msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--disable-radix-cache --mem-fraction-static 0.894 \
--tp-size 2 \
--max-running-requests 1 \
--max-prefill-tokens 16384 \
--served-model-name Qwen3-8B \
--chunked-prefill-size -1 \
--cuda-graph-bs 1 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 4 --random-range-ratio 1
Qwen3-Next 3_5K-1_5K 20ms on A3 1 Cards Mixed Mode
Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 20msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=400
export DEEPEP_NORMAL_LONG_SEQ_ROUND=10
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export HCCL_OP_EXPANSION_MODE="AIV"
export TASK_QUEUE_ENABLE=1
export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=0
export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1
export HCCL_BUFFSIZE=2000
export ZBCCL_LOCAL_MEM_SIZE=60416
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export ZBCCL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export ZBCCL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export ZBCCL_ENABLE_GRAPH=1
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--page-size 128 \
--tp-size 2 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--mem-fraction-static 0.85 \
--disable-radix-cache --max-prefill-tokens 28672 --context-length 26384 --max-total-tokens 122304 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-running-requests 2 \
--cuda-graph-bs 2 \
--mamba-ssm-dtype bfloat16 \
--speculative-draft-model-path /path/to/Qwen3-Next-80B-A3B-Instruct
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 1
Qwen3.5-27B 3_5K-1_5K 20ms on A3 2 Cards Mixed Mode
Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 20msModel Deployment
Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ASCEND_LAUNCH_BLOCKING=1
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=3000
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_NPU_PROFILING=0
export SGLANG_DISAGGEGATION_WAITING_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
MODEL_PATH=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} \
--attention-backend ascend \
--host 127.0.0.1 --port 6699 \
--device npu \
--tp-size 4\
--trust-remote-code \
--watchdog-timeout 9000 \
--chunked-prefill-size -1 \
--max-prefill-tokens 186000 \
--enable-prefill-delayer \
--prefill-delayer-max-delay-passes 200 \
--disable-radix-cache \
--mem-fraction-static 0.94 \
--max-total-tokens 700000 \
--max-running-requests 38 \
--max-mamba-cache-size 200 \
--quantization modelslim \
--dtype bfloat16 \
--mamba-ssm-dtype bfloat16 \
--enable-multimodal \
--mm-attention-backend ascend_attn \
--cuda-graph-bs 1 2 4 8 12 18 24 32 34 36 38 \
--speculative-algorithm NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 6699 --dataset-name random --max-concurrency 38 --num-prompts 152 --random-range-ratio 1 --random-output-len 1500 --random-input-len 3500
Qwen3.5-27B 16K-1K 20ms on A3 1 Cards Mixed Mode
Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 16K+1K TPOT: 20msModel Deployment
Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=8,9
MODEL_PATH=xxx
sglang serve --model-path ${MODEL_PATH} \
--attention-backend ascend \
--device npu \
--tp-size 2 --nnodes 1 --node-rank 0 \
--chunked-prefill-size -1 --max-prefill-tokens 65000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 --max-running-requests 32 --max-mamba-cache-size 32 \
--mem-fraction-static 0.85 \
--port 8001 \
--cuda-graph-bs 2 3 4 5 6 \
--enable-multimodal \
--quantization modelslim \
--mm-attention-backend ascend_attn \
--dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 310000 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8001 --dataset-name random --max-concurrency 32 --num-prompts 128 --random-range-ratio 1 --random-output-len 1000 --random-input-len 16000
Qwen3.5-27B 64K-1K 20ms on A3 1 Cards Mixed Mode
Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 64K+1K TPOT: 20msModel Deployment
Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_NPU_PROFILING=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=4,5
MODEL_PATH=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} \
--attention-backend ascend \
--device npu \
--tp-size 2 --nnodes 1 --node-rank 0 \
--chunked-prefill-size -1 --max-prefill-tokens 130000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 --max-running-requests 32 --max-mamba-cache-size 18 \
--mem-fraction-static 0.5 \
--port 8004 \
--cuda-graph-bs 2 3 4 \
--enable-multimodal \
--quantization modelslim \
--mm-attention-backend ascend_attn \
--dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 280000 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8004 --dataset-name random --max-concurrency 9 --num-prompts 36 --random-range-ratio 1 --random-output-len 1000 --random-input-len 64000
Qwen3.5-27B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode
Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
MODEL_PATH=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} \
--attention-backend ascend \
--device npu \
--tp-size 2 --nnodes 1 --node-rank 0 \
--chunked-prefill-size -1 --max-prefill-tokens 60000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 --max-running-requests 48 --max-mamba-cache-size 60 \
--mem-fraction-static 0.7 \
--port 8000 \
--cuda-graph-bs 2 8 16 32 48 \
--enable-multimodal \
--quantization modelslim \
--mm-attention-backend ascend_attn \
--dtype bfloat16 --mamba-ssm-dtype bfloat16 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8000 --dataset-name random --max-concurrency 48 --num-prompts 192 --random-range-ratio 1 --random-output-len 1500 --random-input-len 3500
Qwen3.5-27B 16K-1K 50ms on A3 2 Cards Mixed Mode
Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 16K+1K TPOT: 50msModel Deployment
Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=30
# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
MODEL_PATH=xxx
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--attention-backend ascend \
--device npu \
--tp-size 4 --nnodes 1 --node-rank 0 \
--chunked-prefill-size -1 --max-prefill-tokens 50000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 --max-running-requests 28 --max-mamba-cache-size 50 \
--mem-fraction-static 0.7 \
--port 8001 \
--cuda-graph-bs 2 8 12 16 20 24 28\
--enable-multimodal \
--quantization modelslim \
--mm-attention-backend ascend_attn \
--dtype bfloat16 --mamba-ssm-dtype bfloat16 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8001 --dataset-name random --max-concurrency 28 --num-prompts 152 --random-range-ratio 1 --random-output-len 1000 --random-input-len 16000
Qwen3.5-27B 64K-1K 50ms on A3 2 Cards Mixed Mode
Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 64K+1K TPOT: 50msModel Deployment
Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
MODEL_PATH=xxx
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--attention-backend ascend \
--device npu \
--tp-size 4 --nnodes 1 --node-rank 0 \
--chunked-prefill-size -1 --max-prefill-tokens 200000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 --max-running-requests 32 --max-mamba-cache-size 22 \
--mem-fraction-static 0.5 \
--port 9000 \
--cuda-graph-bs 2 4 8 11 12 13 \
--enable-multimodal \
--quantization modelslim \
--mm-attention-backend ascend_attn \
--dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 850000 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 9000 --dataset-name random --max-concurrency 9 --num-prompts 36 --random-range-ratio 1 --random-output-len 1000 --random-input-len 64000
Qwen3.5-397B-A17B 3_5K-1_5K 22ms on A3 8 Cards Mixed Mode
Model: Qwen3.5-397B-A17B Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 22msModel Deployment
Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export ASCEND_USE_FIA=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export HCCL_BUFFSIZE=3000
export DEEPEP_NORMAL_LONG_SEQ_ROUND=32
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3584
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ZBAL_LOCAL_MEM_SIZE=58624
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://127.0.0.1:24669"
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export ZBAL_ENABLE_GRAPH=1
MODEL_PATH=xxx
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 16 \
--chunked-prefill-size -1 --max-prefill-tokens 35000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 --max-running-requests 160 \
--mem-fraction-static 0.8 \
--port 6699 \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 \
--quantization modelslim \
--enable-multimodal --moe-a2a-backend deepep --deepep-mode auto \
--mm-attention-backend ascend_attn \
--dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 128000 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--dp-size 8 --enable-dp-attention --enable-dp-lm-head \
--enable-prefill-delayer --prefill-delayer-max-delay-passes 100
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 120 --random-output-len 1500 --random-input-len 3500 --num-prompts 480
Qwen3.5-397B-A17B 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode
Model: Qwen3.5-397B-A17B Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export ASCEND_USE_FIA=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export HCCL_BUFFSIZE=3000
export DEEPEP_NORMAL_LONG_SEQ_ROUND=32
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3584
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ZBAL_LOCAL_MEM_SIZE=59648
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://127.0.0.1:24669"
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export ZBAL_ENABLE_GRAPH=1
MODEL_PATH=xxx
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 16 \
--chunked-prefill-size -1 --max-prefill-tokens 17500 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 --max-running-requests 432 \
--mem-fraction-static 0.75 \
--port 6699 \
--cuda-graph-bs 2 4 6 8 12 16 20 24 28 32 36 40 44 48 52 56 \
--quantization modelslim \
--enable-multimodal --moe-a2a-backend deepep --deepep-mode auto \
--mm-attention-backend ascend_attn \
--dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 280000 \
--dp-size 8 --enable-dp-attention --enable-dp-lm-head \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--enable-prefill-delayer --prefill-delayer-max-delay-passes 200
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 352 --random-output-len 1500 --random-input-len 3500 --num-prompts 1408
MiniMax-M2.5 3_5K-1_5K Low Latency on A3 8 Cards Mixed Mode
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5KModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE=AIV
export TASK_QUEUE_ENABLE=1
export HCCL_BUFFSIZE=1500
export ASCEND_USE_FIA=1
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=224000
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--host 127.0.0.1 \
--port 32001 \
--tp-size 16 \
--dp-size 16 \
--enable-dp-attention \
--mem-fraction-static 0.75 \
--max-running-requests 128 \
--disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-token 8192 \
--cuda-graph-bs 2 4 6 8 \
--moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--dtype bfloat16 \
--tokenizer-worker-num 2 \
--prefill-delayer-max-delay-passes 500 \
--enable-prefill-delayer
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32001 --random-input-len 3500 --random-output-len 1500 --num-prompts 320 --random-range-ratio 1 --max-concurrency 80
MiniMax-M2.5 128K-1K Low Latency on A3 8 Cards Mixed Mode
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 128K+1KModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export TASK_QUEUE_ENABLE=1
export ASCEND_USE_FIA=1
export HCCL_BUFFSIZE=1600
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_NPU_DEEPEP_USE_FUSED_MOE_DECODE=1
export SGLANG_NPU_FUSEEP_DECODE_ONLY=1
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--host 127.0.0.1 \
--port 32000 \
--tp-size 16 \
--dp-size 2 \
--enable-dp-attention \
--prefill-delayer-max-delay-passes 100 \
--enable-prefill-delayer \
--mem-fraction-static 0.65 \
--max-running-requests 8 \
--chunked-prefill-size -1 --max-prefill-token 130000 \
--cuda-graph-bs 1 2 4 \
--moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--dtype bfloat16 \
--trust-remote-code \
--tokenizer-worker-num 8
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32000 --random-input-len 131072 --random-output-len 1024 --num-prompts 8 --random-range-ratio 1 --max-concurrency 2
MiniMax-M2.5 3_5K-1_5K High Throughput on A3 8 Cards Mixed Mode
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5KModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE=AIV
export TASK_QUEUE_ENABLE=1
export HCCL_BUFFSIZE=800
export ASCEND_USE_FIA=1
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=204800
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--host 127.0.0.1 \
--port 32001 \
--tp-size 16 \
--enable-dp-attention \
--dp-size 16 \
--mem-fraction-static 0.75 \
--max-running-requests 480 \
--disable-radix-cache \
--prefill-delayer-max-delay-passes 500 \
--enable-prefill-delayer \
--chunked-prefill-size -1 --max-prefill-token 8192 \
--cuda-graph-bs 8 16 24 32 48 64 80 \
--moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32001 --random-input-len 3500 --random-output-len 1500 --num-prompts 1280 --random-range-ratio 1 --max-concurrency 320
MiniMax-M2.5 64K-1K High Throughput on A3 8 Cards Mixed Mode
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 64K+1KModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export TASK_QUEUE_ENABLE=1
export ASCEND_USE_FIA=1
export HCCL_BUFFSIZE=1600
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_NPU_DEEPEP_USE_FUSED_MOE_DECODE=1
export SGLANG_NPU_FUSEEP_DECODE_ONLY=1
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--host 127.0.0.1 \
--port 32000 \
--tp-size 16 \
--dp-size 2 \
--enable-dp-attention \
--prefill-delayer-max-delay-passes 100 \
--enable-prefill-delayer \
--mem-fraction-static 0.65 \
--max-running-requests 72 \
--chunked-prefill-size -1 --max-prefill-token 180000 \
--cuda-graph-bs 8 16 24 32 40 \
--moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--dtype bfloat16 \
--trust-remote-code \
--tokenizer-worker-num 8
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32000 --random-input-len 65536 --random-output-len 1024 --num-prompts 144 --random-range-ratio 1 --max-concurrency 36
MiniMax-M2.5 128K-1K High Throughput on A3 8 Cards Mixed Mode
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 128K+1KModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export TASK_QUEUE_ENABLE=1
export ASCEND_USE_FIA=1
export HCCL_BUFFSIZE=1600
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_NPU_DEEPEP_USE_FUSED_MOE_DECODE=1
export SGLANG_NPU_FUSEEP_DECODE_ONLY=1
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--host 127.0.0.1 \
--port 32000 \
--tp-size 16 \
--dp-size 2 \
--enable-dp-attention \
--prefill-delayer-max-delay-passes 100 \
--enable-prefill-delayer \
--mem-fraction-static 0.65 \
--max-running-requests 36 \
--chunked-prefill-size -1 --max-prefill-token 130000 \
--cuda-graph-bs 8 16 24 \
--moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--dtype bfloat16 \
--trust-remote-code \
--tokenizer-worker-num 8
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32000 --random-input-len 131072 --random-output-len 1024 --num-prompts 128 --random-range-ratio 1 --max-concurrency 32
MiniMax-M2.5 64K-1K High Throughput on A3 4 Cards Mixed Mode
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 4Card DeployMode: PD Mixed Dataset: random Input Output Length: 64K+1KModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export TASK_QUEUE_ENABLE=1
export ASCEND_USE_FIA=0
export HCCL_BUFFSIZE=1600
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_NPU_DEEPEP_USE_FUSED_MOE_DECODE=1
export SGLANG_NPU_FUSEEP_DECODE_ONLY=1
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--host 127.0.0.1 \
--port 32000 \
--tp-size 8 \
--enable-dp-attention \
--prefill-delayer-max-delay-passes 500 \
--enable-prefill-delayer \
--mem-fraction-static 0.65 \
--max-running-requests 36 \
--chunked-prefill-size -1 --max-prefill-token 150000 \
--cuda-graph-bs 8 16 24 32 40 \
--moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--dtype bfloat16 \
--trust-remote-code \
--tokenizer-worker-num 8
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32000 --random-input-len 65536 --random-output-len 1024 --num-prompts 144 --random-range-ratio 1 --max-concurrency 36
MiniMax-M2.5 64K-1K High Throughput on A3 16 Cards Disaggregation Mode
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 64K+1KModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your_prefill_ip:24667"
P_IP=('your_prefill_ip')
D_IP=('your_decode_ip')
D_MASTER="${D_IP[0]}:8001"
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export ASCEND_USE_FIA=1
export HCCL_BUFFSIZE=2500
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 32000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.43 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 128 \
--chunked-prefill-size -1 --max-prefill-tokens 58000 --moe-a2a-backend deepep --deepep-mode normal \
--tokenizer-worker-num 16 \
--dp-size 2 --enable-dp-attention --dtype bfloat16 --load-balance-method round_robin \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant --skip-server-warmup
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export HCCL_BUFFSIZE=1600
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS=96
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--cuda-graph-bs 8 16 24 32 40 \
--port 33000 --trust-remote-code \
--tp-size 16 --mem-fraction-static 0.76 --attention-backend ascend --device npu --quantization modelslim \
--nnodes 1 --node-rank $i --dist-init-addr $D_MASTER \
--disaggregation-transfer-backend ascend --max-running-requests 80 \
--chunked-prefill-size -1 --moe-a2a-backend ascend_fuseep --deepep-mode low_latency \
--tokenizer-worker-num 16 \
--dp-size 2 --enable-dp-attention --dtype bfloat16 \
--load-balance-method round_robin \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy round_robin \
--prefill http://your_prefill_ip:32000 8998 \
--decode http://your_decode_ip:33000 \
--host 127.0.0.1 \
--mini-lb \
--port 6688
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --random-input-len 65536 --random-output-len 1024 --num-prompts 640 --random-range-ratio 1 --max-concurrency 160
MiniMax-M2.5 128K-1K High Throughput on A3 16 Cards Disaggregation Mode
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 128K+1KModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your_prefill_ip:24667"
P_IP=('your_prefill_ip')
D_IP=('your_decode_ip')
D_MASTER="${D_IP[0]}:8001"
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export ASCEND_USE_FIA=1
export HCCL_BUFFSIZE=2500
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 32000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.43 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 128 \
--chunked-prefill-size -1 --max-prefill-tokens 130000 --moe-a2a-backend deepep --deepep-mode normal \
--tokenizer-worker-num 16 \
--dp-size 2 --enable-dp-attention --dtype bfloat16 --load-balance-method round_robin \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 2 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 3 \
--speculative-draft-model-quantization unquant --skip-server-warmup
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export HCCL_BUFFSIZE=1600
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS=96
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--cuda-graph-bs 2 4 8 \
--port 33000 --trust-remote-code \
--tp-size 16 --mem-fraction-static 0.76 --attention-backend ascend --device npu --quantization modelslim \
--nnodes 1 --node-rank $i --dist-init-addr $D_MASTER \
--disaggregation-transfer-backend ascend --max-running-requests 80 \
--chunked-prefill-size -1 --moe-a2a-backend ascend_fuseep --deepep-mode low_latency \
--tokenizer-worker-num 8 \
--dp-size 2 --enable-dp-attention --dtype bfloat16 \
--load-balance-method round_robin \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $EAGLE_MODEL_PATH \
--speculative-num-steps 2 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 3 \
--speculative-draft-model-quantization unquant
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy round_robin \
--prefill http://your_prefill_ip:32000 8998 \
--decode http://your_decode_ip:33000 \
--host 127.0.0.1 \
--mini-lb \
--port 6688
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --random-input-len 131072 --random-output-len 1024 --num-prompts 192 --random-range-ratio 1 --max-concurrency 48
Kimi K2.5 w4a8 3_5K-1_5K 20ms on A3 8 Cards Mixed Mode
Model: Kimi-K2.5-w4a8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 20msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export STREAMS_PER_DEVICE=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=48
export HCCL_BUFFSIZE=1200
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
MODEL_PATH=xxx
DRAFT_PATH=xxx
python3 -m sglang.launch_server \
--model-path $MODEL_PATH --quantization modelslim --dtype bfloat16 \
--model-loader-extra-config '{"enable_multithread_load": true}' \
--host 0.0.0.0 --port 6699 \
--trust-remote-code --device npu --attention-backend ascend \
--tp-size 16 --base-gpu-id 0 --mem-fraction-static 0.78 --max-running-requests 64 \
--chunked-prefill-size 32768 --context-length 8192 --max-prefill-tokens 16384 \
--enable-multimodal --mm-attention-backend ascend_attn --sampling-backend ascend \
--enable-dp-attention --dp-size 16 \
--moe-a2a-backend deepep --deepep-mode auto \
--cuda-graph-bs 1 2 3 4 --disable-radix-cache \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $DRAFT_PATH \
--speculative-num-steps 4 --speculative-eagle-topk 1 \
--speculative-num-draft-tokens 5 \
--speculative-draft-model-quantization unquant
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 64 --random-output-len 1500 --random-input-len 3500 --num-prompts 64
Kimi K2.5 w4a8 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode
Model: Kimi-K2.5-w4a8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export STREAMS_PER_DEVICE=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=96
export HCCL_BUFFSIZE=1200
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
MODEL_PATH=xxx
DRAFT_PATH=xxx
python3 -m sglang.launch_server \
--model-path $MODEL_PATH --quantization modelslim --dtype bfloat16 \
--model-loader-extra-config '{"enable_multithread_load": true}' \
--host 0.0.0.0 --port 6699 \
--trust-remote-code --device npu --attention-backend ascend \
--tp-size 16 --base-gpu-id 0 --mem-fraction-static 0.7 --max-running-requests 120 \
--chunked-prefill-size 32768 --context-length 8192 --max-prefill-tokens 16384 \
--enable-multimodal --mm-attention-backend ascend_attn --sampling-backend ascend \
--enable-dp-attention --dp-size 16 \
--moe-a2a-backend deepep --deepep-mode auto \
--cuda-graph-bs 1 2 4 8 12 16 24 32 48 64 96 120 --disable-radix-cache \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $DRAFT_PATH \
--speculative-num-steps 4 --speculative-eagle-topk 1 \
--speculative-num-draft-tokens 5 \
--speculative-draft-model-quantization unquant
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 120 --random-output-len 1500 --random-input-len 3500 --num-prompts 120
GLM-5.1 3_5K-1_5K 41ms on A3 16 Cards Mixed Mode
Model: GLM-5.1The model is quantized, with MTP layers excluded from quantization.
Model Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTHONPATH=/path/to/sglang/python:$PYTHONPATH
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
MODEL_PATH=/path/to/GLM-5.1-w4a8
P_IP=('your ip1' 'your ip2')
P_MASTER="${P_IP[0]}:4567"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=2500
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--dist-init-addr ${P_IP[0]}:5000 \
--tp-size 32 --nnodes 2 --node-rank $i \
--dp-size 16 --enable-dp-attention \
--chunked-prefill-size 131072 --max-prefill-tokens 280000 \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.65 \
--port 8001 \
--served-model-name glm-5 \
--cuda-graph-max-bs 8 \
--max-running-requests 128 \
--quantization modelslim \
--speculative-draft-model-quantization unquant \
--moe-a2a-backend deepep --deepep-mode auto \
--load-balance-method round_robin \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
NODE_RANK=$i
break
fi
done
Quantization Configuration:
--quantization modelslimis only applicable for quantized models.--speculative-draft-model-quantization unquantshould be configured based on model specs, turned on for non-quantized MTP layers.
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 8001 --random-range-ratio 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 320
GLM-5.1 16K-1K 23ms on A3 32 Cards Disaggregation Mode
Model: GLM-5.1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 16K+1K TPOT: 23msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTHONPATH=/path/to/sglang/python:$PYTHONPATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://${P_IP[0]}:24707"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=/path/to/GLM-5.1-w4a8
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export ENABLE_PROFILING=0
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export HCCL_BUFFSIZE=8
unset PYTORCH_NPU_ALLOC_CONF
export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://${P_IP[0]}:24672"
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port 8998 --dist-init-addr ${P_IP[0]}:5000 --trust-remote-code --nnodes 2 --node-rank $i \
--tp-size 32 --mem-fraction-static 0.75 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 64 \
--served-model-name glm-5 --chunked-prefill-size 524288 --max-prefill-tokens 180000 --moe-a2a-backend deepep --deepep-mode normal \
--disable-shared-experts-fusion --disable-cuda-graph --dtype bfloat16 \
--dp-size 4 --enable-dp-attention \
--load-balance-method round_robin \
--enable-nsa-prefill-context-parallel \
--nsa-prefill-cp-mode in-seq-split \
--attn-cp-size 8 \
--enable-dp-lm-head --moe-dense-tp 1 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=650
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
export TASK_QUEUE_ENABLE=0
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export SGLANG_NPU_USE_MULTI_STREAM=1
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8003 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --ep-size 32 \
--mem-fraction-static 0.87 --max-running-requests 128 --attention-backend ascend --device npu --quantization modelslim \
--served-model-name glm-5 --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency \
--cuda-graph-bs 1 2 3 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 180000 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 --load-balance-method round_robin \
--speculative-draft-model-quantization unquant \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy round_robin \
--prefill http://your_prefill_ip1:8000 8998 \
--decode http://your_decode_ip1:8003 \
--host 127.0.0.1 \
--port 6688
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 8003 --random-range-ratio 1 --random-output-len 1000 --random-input-len 16000 --num-prompts 192
GLM-5.1 64K-1K-90%_cache_hit 45ms on A3 48 Cards Disaggregation Mode
Model: GLM-5.1 Hardware: Atlas 800I A3 48Card DeployMode: PD Disaggregation Dataset: random (90% cache hit) Input Output Length: 64K+1K TPOT: 45msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTHONPATH=/path/to/sglang/python:$PYTHONPATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://${P_IP[0]}:24709"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200
P_IP=('your prefill ip1' 'your prefill ip2' 'your prefill ip3' 'your prefill ip4')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=/path/to/GLM-5.1-w4a8
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export ENABLE_PROFILING=0
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export ZBAL_HCCL_OP="send,recv"
export HCCL_BUFFSIZE=128
unset PYTORCH_NPU_ALLOC_CONF
export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://${P_IP[$i]}:24691"
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998 + i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 4 --mem-fraction-static 0.72 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 16 \
--served-model-name glm-5 --chunked-prefill-size 16384 --max-prefill-tokens 180000 --moe-a2a-backend deepep --deepep-mode normal \
--disable-shared-experts-fusion --disable-cuda-graph --dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--enable-nsa-prefill-context-parallel \
--nsa-prefill-cp-mode in-seq-split \
--attn-cp-size 4 \
--enable-dp-lm-head --moe-dense-tp 1 \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--pp-size 4
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=300
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=40
export TASK_QUEUE_ENABLE=0
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_LM_HEAD_TP=4
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8003 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --enable-dp-attention --ep-size 32 \
--mem-fraction-static 0.85 --max-running-requests 320 --attention-backend ascend --device npu --quantization modelslim \
--served-model-name glm-5 --moe-a2a-backend deepep --deepep-mode low_latency \
--cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 180000 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 --load-balance-method round_robin \
--speculative-draft-model-quantization unquant \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disaggregation-enable-decode-radix-cache
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy round_robin \
--prefill http://your_prefill_ip1:8000 8998 \
--prefill http://your_prefill_ip2:8000 8999 \
--prefill http://your_prefill_ip3:8000 9000 \
--prefill http://your_prefill_ip4:8000 9001 \
--decode http://your_decode_ip1:8003 \
--host 127.0.0.1 \
--port 6688
Benchmark
We tested it based on theRANDOM dataset (90% cache hit), this dataset is generated through this tool.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 8003 --random-range-ratio 1 --random-output-len 1000 --random-input-len 64000 --num-prompts 192
GLM-5.1 128K-1K-90%_cache_hit 32ms on A3 48 Cards Disaggregation Mode
Model: GLM-5.1 Hardware: Atlas 800I A3 48Card DeployMode: PD Disaggregation Dataset: random (90% cache hit) Input Output Length: 128K+1K TPOT: 32msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTHONPATH=/path/to/sglang/python:$PYTHONPATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://${P_IP[0]}:24709"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200
P_IP=('your prefill ip1' 'your prefill ip2')
P1_IP=('your prefill ip3' 'your prefill ip4')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=/path/to/GLM-5.1-w4a8
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill group 1
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export ENABLE_PROFILING=0
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export ZBAL_HCCL_OP="send,recv"
export HCCL_BUFFSIZE=128
unset PYTORCH_NPU_ALLOC_CONF
export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://${P_IP[0]}:24691"
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port 8998 --trust-remote-code --nnodes 2 --node-rank $i --dist-init-addr ${P_IP[0]}:5000 \
--tp-size 4 --mem-fraction-static 0.72 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 32 \
--served-model-name glm-5 --chunked-prefill-size 16384 --max-prefill-tokens 180000 --moe-a2a-backend deepep --deepep-mode normal \
--disable-shared-experts-fusion --disable-cuda-graph --dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--enable-nsa-prefill-context-parallel \
--nsa-prefill-cp-mode in-seq-split \
--attn-cp-size 4 \
--enable-dp-lm-head --moe-dense-tp 1 \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--pp-size 8
NODE_RANK=$i
break
fi
done
# prefill group 2
for i in "${!P1_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P1_IP[$i]}" || "$LOCAL_HOST2" == "${P1_IP[$i]}" ]];
then
echo "${P1_IP[$i]}"
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export ENABLE_PROFILING=0
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export ZBAL_HCCL_OP="send,recv"
export HCCL_BUFFSIZE=128
unset PYTORCH_NPU_ALLOC_CONF
export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://${P1_IP[0]}:24691"
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P1_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port 8999 --trust-remote-code --nnodes 2 --node-rank $i --dist-init-addr ${P1_IP[0]}:5000 \
--tp-size 4 --mem-fraction-static 0.72 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 32 \
--served-model-name glm-5 --chunked-prefill-size 16384 --max-prefill-tokens 180000 --moe-a2a-backend deepep --deepep-mode normal \
--disable-shared-experts-fusion --disable-cuda-graph --dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--enable-nsa-prefill-context-parallel \
--nsa-prefill-cp-mode in-seq-split \
--attn-cp-size 4 \
--enable-dp-lm-head --moe-dense-tp 1 \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--pp-size 8
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=200
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=24
export TASK_QUEUE_ENABLE=0
export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic
export SGLANG_NPU_USE_MULTI_STREAM=1
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8003 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --enable-dp-attention --ep-size 32 \
--mem-fraction-static 0.865 --max-running-requests 96 --attention-backend ascend --device npu --quantization modelslim \
--served-model-name glm-5 --moe-a2a-backend deepep --deepep-mode low_latency \
--cuda-graph-bs 1 2 3 4 5 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 \
--tokenizer-worker-num 32 --disable-shared-experts-fusion --dtype bfloat16 --load-balance-method round_robin \
--speculative-draft-model-quantization unquant \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disaggregation-decode-enable-radix-cache
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy round_robin \
--prefill http://your_prefill_ip1:8000 8998 \
--prefill http://your_prefill_ip3:8000 8999 \
--decode http://your_decode_ip1:8003 \
--host 127.0.0.1 \
--port 6688
Benchmark
We tested it based on theRANDOM dataset (90% cache hit), this dataset is generated through this tool.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 8003 --random-range-ratio 1 --random-output-len 1000 --random-input-len 131072 --num-prompts 192
