DeepSeek Series Models
Low Latency
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 6K+1.6K | 20ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.9K+1K | 19ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.5K+1.5K | 19ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.5K+1K | 19ms | W8A8 INT8 | Optimal Configuration |
| DeepSeek-V3.2 | Atlas 800I A3 | 32 | PD Disaggregation | 128K+1K | 26ms | W8A8 INT8 | Optimal Configuration |
High Throughput
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Deepseek-R1 | Atlas 800I A3 | 32 | PD Disaggregation | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 24 | PD Disaggregation | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 8 | PD Mixed | 2K+2K | 50ms | W4A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 16 | PD Disaggregation | 2K+2K | 50ms | W4A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W4A8 INT8 | Optimal Configuration |
| Deepseek-R1 | Atlas 800I A3 | 16 | PD Disaggregation | 3.5K+1.5K | 50ms | W4A8 INT8 | Optimal Configuration |
Qwen Series Models
Low Latency
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 11K+1K | 10ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 4 | PD Mixed | 6K+1.5K | 18ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 4 | PD Mixed | 4K+1.5K | 11ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 8 | PD Mixed | 18K+4K | 6ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 6K+1.5K | 18ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 4K+1.5K | 11ms | BF16 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 1K+0.3K | 12ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 6K+1.5K | 17ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 1K+0.3K | 7ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 6K+1.5K | 12ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 5ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-30B-A3B | Atlas 800I A3 | 1 | PD Mixed | 6K+1.5K | 10ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-30B-A3B | Atlas 800I A3 | 1 | PD Mixed | 1K+0.3K | 7ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Next-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 1K+0.3K | 14.21ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Next-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 6K+1.5K | 15.62ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Next-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 3.5K+1.5K | 20ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-14B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 9ms | W8A8 INT8 | Optimal Configuration |
High Throughput
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Qwen3-235B-A22B | Atlas 800I A3 | 24 | PD Disaggregation | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 2K+2K | 100ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-235B-A22B | Atlas 800I A3 | 8 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-235B-A22B | Atlas 800I A3 | 16 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A3 | 2 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-30B-A3B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Coder-480B-A35B-Instruct | Atlas 800I A3 | 24 | PD Disaggregation | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Coder-480B-A35B-Instruct | Atlas 800I A3 | 16 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Coder-480B-A35B-Instruct | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-Next-80B-A3B-Instruct | Atlas 800I A3 | 2 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-32B | Atlas 800I A2 | 8 | PD Mixed | 2K+2K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-14B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
Optimal Configuration
DeepSeek-R1 3_5K-1_5K 50ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"
P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export SGLANG_USE_AG_AFTER_QLORA=1
export HCCL_BUFFSIZE=800
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=131072
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.778 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 16 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 60000 --moe-a2a-backend ascend_fuseep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=600
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
export TASK_QUEUE_ENABLE=1
export SGLANG_NPU_FUSED_MOE_MODE=1
export SGLANG_LM_HEAD_TP=8
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \
--mem-fraction-static 0.82 --max-running-requests 1024 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --moe-dense-tp 1 \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--prefill http://P_IP:8000 8999 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 768 --random-input-len 3500 --random-output-len 1500 --num-prompts 3072 --random-range-ratio 1 --request-rate 16
DeepSeek-R1 2K-2K 50ms on A3 24 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 24Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"
P_IP=('your prefill ip1')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=1600
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export SGLANG_USE_AG_AFTER_QLORA=1
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.8 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 20 --context-length 8192 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=800
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=102
export TASK_QUEUE_ENABLE=1
export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
export SGLANG_NPU_FUSED_MOE_MODE=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \
--mem-fraction-static 0.81 --max-running-requests 1088 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--prefill http://P_IP:8000 8999 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 1088 \
--random-input-len 2048 \
--random-output-len 2048 \
--num-prompts 12800 \
--random-range-ratio 1 \
--request-rate 24
DeepSeek-R1 6K-1_6K 20ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 6K+1.6K TPOT: 20msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"
P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=1536
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 4 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=650
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16
export TASK_QUEUE_ENABLE=1
export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 8 \
--mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \
--cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--prefill http://P_IP:8000 8999 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 32 \
--random-input-len 6000 \
--random-output-len 1600 \
--num-prompts 32 \
--random-range-ratio 1 \
--request-rate 16
DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.9K+1K TPOT: 19msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"
P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=1536
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 4 --context-length 8192 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=650
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=12
export TASK_QUEUE_ENABLE=1
export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 16 \
--mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \
--cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--prefill http://P_IP:8000 8999 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 32 \
--random-input-len 3900 \
--random-output-len 1024 \
--num-prompts 32 \
--random-range-ratio 1 \
--request-rate 16
DeepSeek-R1 3_5K-1_5K 19ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 19msModel Deployment
Please Turn to DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation ModeBenchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 32 \
--random-input-len 3500 \
--random-output-len 1500 \
--num-prompts 32 \
--random-range-ratio 1 \
--request-rate 16
DeepSeek-R1 3_5K-1K 19ms on A3 32 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1K TPOT: 19msModel Deployment
Please Turn to DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation ModeBenchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 32 \
--random-input-len 3500 \
--random-output-len 1024 \
--num-prompts 32 \
--random-range-ratio 1 \
--request-rate 16
DeepSeek-R1 2K-2K 50ms on A3 8 Cards Mixed Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=88
export HCCL_BUFFSIZE=1600
export DEEPEP_NORMAL_LONG_SEQ_ROUND=10
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512
MODEL_PATH=xxx
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_USE_FIA_NZ=1
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 16 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--cuda-graph-bs 4 8 20 21 22 \
--mem-fraction-static 0.78 \
--max-running-requests 352 \
--disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 1500 \
--moe-a2a-backend deepep --deepep-mode auto \
--enable-dp-attention --dp-size 16 --enable-dp-lm-head \
--speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \
--dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 352 --random-input-len 2048 --random-output-len 2048 --num-prompts 1408 --random-range-ratio 1
DeepSeek-R1 2K-2K 50ms on A3 16 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667"
P_IP=('your prefill ip1')
D_IP=('your decode ip1')
MODEL_PATH=xxx
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ENABLE_MOE_NZ=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=2600
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.7 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 10240 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=900
export SGLANG_DP_ROUND_ROBIN=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=112
export TASK_QUEUE_ENABLE=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \
--mem-fraction-static 0.8 --max-running-requests 448 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 448 --random-input-len 2048 --random-output-len 2048 --num-prompts 1792 --random-range-ratio 1 --request-rate 32
DeepSeek-R1 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=56
export HCCL_BUFFSIZE=1200
export DEEPEP_NORMAL_LONG_SEQ_ROUND=10
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_USE_FIA_NZ=1
MODEL_PATH=xxx
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 16 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--cuda-graph-bs 4 8 12 14 \
--mem-fraction-static 0.77 \
--max-running-requests 224 \
--context-length 8188 --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 3000 \
--moe-a2a-backend deepep --deepep-mode auto \
--enable-dp-attention --dp-size 16 --enable-dp-lm-head \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 224 --random-input-len 3500 --random-output-len 1500 --num-prompts 896 --random-range-ratio 1
DeepSeek-R1 3_5K-1_5K 50ms on A3 16 Cards Disaggregation Mode
Model: Deepseek R1 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667"
P_IP=('your prefill ip1')
D_IP=('your decode ip1')
MODEL_PATH=xxx
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ENABLE_MOE_NZ=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=3500
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.62 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 20480 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16
NODE_RANK=$i
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=800
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=78
export TASK_QUEUE_ENABLE=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \
--mem-fraction-static 0.805 --max-running-requests 416 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \
--disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP:8000 8998 \
--decode http://D_IP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 416 --random-input-len 3500 --random-output-len 1500 --num-prompts 1664 --random-range-ratio 1
DeepSeek-V3.2 128K-1K 26ms on A3 32 Cards Disaggregation Mode
Model: DeepSeek-V3.2-W8A8 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 128K+1K TPOT: 26msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24670"
P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=1200
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 32 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host ${P_IP[$i]} --port 8000 \
--mem-fraction-static 0.73 \
--disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 68000 \
--max-running-requests 1 \
--moe-a2a-backend deepep --deepep-mode normal \
--quantization modelslim \
--disaggregation-transfer-backend ascend \
--disaggregation-mode prefill \
--disable-cuda-graph \
--nnodes 2 --node-rank $i \
--disaggregation-bootstrap-port 8995 \
--moe-dense-tp-size 1 \
--enable-nsa-prefill-context-parallel \
--nsa-prefill-cp-mode in-seq-split \
--attn-cp-size 32 \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dist-init-addr ${P_IP[0]}:10000
break
fi
done
# decode
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export TASK_QUEUE_ENABLE=0
export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
DP=8
export HCCL_BUFFSIZE=400
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=8
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 32 \
--dp ${DP} \
--ep 32 \
--moe-dense-tp-size 1 \
--enable-dp-attention \
--enable-dp-lm-head \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host ${D_IP[$i]} --port 8001 \
--mem-fraction-static 0.79 \
--disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 68000 \
--max-running-requests 32 \
--cuda-graph-max-bs 4 \
--moe-a2a-backend deepep \
--deepep-mode low_latency \
--quantization modelslim \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disaggregation-transfer-backend ascend \
--disaggregation-mode decode \
--nnodes 2 --node-rank $i \
--dist-init-addr ${D_IP[0]}:10000
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://P_IP1:8000 8995 \
--decode http://D_IP1:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 8 --random-input-len 131076 --random-output-len 1024 --num-prompts 8 --random-range-ratio 1
Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 24Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_DP_ROUND_ROBIN=1
export SGLANG_NPU_FUSED_MOE_MODE=2
MODEL_PATH=xxx
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667"
P_IP=('your prefill ip1')
D_IP=('your decode ip1' 'your decode ip2')
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
export DEEPEP_NORMAL_LONG_SEQ_ROUND=16
export HCCL_BUFFSIZE=4300
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export STREAMS_PER_DEVICE=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
# P节点
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \
--host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \
--nnodes 1 --node-rank $i --tp-size 16 --dp-size 16 --mem-fraction-static 0.6 \
--disable-radix-cache \
--attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--max-running-requests 128 --chunked-prefill-size 94208 --max-prefill-tokens 262144 \
--enable-dp-attention \
--moe-a2a-backend ascend_fuseep --dtype bfloat16
NODE_RANK=$i
break
fi
done
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export DP_ROUND_ROBIN=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536
export HCCL_BUFFSIZE=800
export HCCL_SOCKET_IFNAME=data0.3001
export GLOO_SOCKET_IFNAME=data0.3001
export STREAMS_PER_DEVICE=32
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \
--host ${D_IP[$i]} --port 8001 --trust-remote-code \
--nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.83 --max-running-requests 768 \
--attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \
--moe-a2a-backend ascend_fuseep --cuda-graph-bs 6 8 12 15 18 20 22 24 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-draft-model-quantization unquant \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--dist-init-addr xxx:5000 \
--disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://PIP:8000 8995 \
--decode http://DIP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang-oai --host 127.0.0.1 --port 7239 --max-concurrency 860 --random-input-len 3500 --random-output-len 1500 --num-prompts 3440 --random-range-ratio 1
Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=570
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416
export SGLANG_NPU_FUSED_MOE_MODE=2
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 432 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 94208 --max-prefill-tokens 458880 --sampling-backend ascend \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disable-radix-cache --moe-a2a-backend ascend_fuseep --speculative-draft-model-quantization unquant \
--tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.8 --cuda-graph-bs 1 2 4 8 16 20 24 26 27
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 272 --random-input-len 3500 --random-output-len 1500 --num-prompts 1088 --random-range-ratio 1
Qwen3-235B-A22B 2K-2K 100ms on A3 8 Cards Mixed Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 100msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=1200
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=144
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 576 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 32768 --max-prefill-tokens 458880 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --speculative-draft-model-quantization unquant \
--tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.84 --cuda-graph-bs 8 16 20 24 32 36
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 576 --random-input-len 2000 --random-output-len 2000 --num-prompts 576 --random-range-ratio 1
Qwen3-235B-A22B 2K-2K 50ms on A3 8 Cards Mixed Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=450
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=147456
export SGLANG_NPU_FUSED_MOE_MODE=2
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 624 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 73728 --max-prefill-tokens 458880 --speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--disable-radix-cache --moe-a2a-backend ascend_fuseep \
--tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.83 --cuda-graph-bs 4 8 16 24 28 29 30 32 34 36 37 38 39
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 480 --random-input-len 2048 --random-output-len 2048 --num-prompts 480 --random-range-ratio 1
Qwen3-235B-A22B 2K-2K 50ms on A3 16 Cards Mixed Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 16Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=1600
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
MIX_IP=('IP1' 'IP2')
for i in "${!MIX_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]];
then
echo "${MIX_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path ${MODEL_PATH} \
--host 127.0.0.1 --port 7439 --trust-remote-code \
--nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.8 --max-running-requests 768 \
--attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \
--moe-a2a-backend deepep --deepep-mode auto --cuda-graph-bs 6 8 10 12 18 24 \
--dist-init-addr ${MIX_IP[0]}:5000 --chunked-prefill-size 131072 --max-prefill-tokens 458880 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx --speculative-draft-model-quantization= unquant \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--context-length 8192 --disable-radix-cache \
--enable-dp-lm-head --dtype bfloat16
NODE_RANK=$i
break
fi
done
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 768 --random-input-len 2000 --random-output-len 2000 --num-prompts 768 --random-range-ratio 1
Qwen3-235B-A22B 11K-1K 10ms on A3 8 Cards Mixed Mode
Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 11K+1K TPOT: 10msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=1600
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 1 --dtype bfloat16 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 --speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--disable-radix-cache --enable-dp-lm-head \
--tp 16 --mem-fraction-static 0.78 --cuda-graph-bs 1
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 1 --random-input-len 11000 --random-output-len 1000 --num-prompts 1 --random-range-ratio 1
Qwen3-32B 6K-1_5K 18ms on A3 4 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 4Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 18msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 32 \
--disable-radix-cache \
--chunked-prefill-size 24576 --max-prefill-tokens 65536 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1
Qwen3-32B 4K-1_5K 11ms on A3 4 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 4Card DeployMode: PD Mixed Dataset: random Input Output Length: 4K+1.5K TPOT: 11msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 1 \
--disable-radix-cache \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size 24576 --max-prefill-tokens 65536 \
--tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4
Qwen3-32B 18K-4K 6ms on A3 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 18K+4K TPOT: 6msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 1 \
--disable-radix-cache --speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 65536 \
--tp-size 16 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 18000 --random-input-len 4000 --num-prompts 1
Qwen3-32B 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 78 \
--disable-radix-cache --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-prefill-tokens 49152 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--tp-size 4 --mem-fraction-static 0.72 --cuda-graph-bs 16 32 64 68 72 78 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1
Qwen3-32B 2K-2K 50ms on A3 2 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 120 \
--disable-radix-cache --speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--chunked-prefill-size -1 --max-prefill-tokens 49152 \
--tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 480 --random-range-ratio 1
Qwen3-30B-A3B 3_5K-1_5K 50ms on A3 1 Card Mixed Mode
Model: Qwen3-30B-A3B-Instruct-2507 Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export ASCEND_LAUNCH_BLOCKING=0
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 162 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--chunked-prefill-size -1 --max-prefill-tokens 35000 \
--tp-size 2 --mem-fraction-static 0.87 --cuda-graph-bs 1 5 15 40 70 100 120 130 140 146 150 154 156 158 160 162 \
--dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 156 --random-input-len 3500 --random-output-len 1500 --num-prompts 624 --random-range-ratio 1
Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode
Model: Qwen3-Coder-480B-A35B-Instruct Hardware: Atlas 800I A3 24Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_NPU_FUSED_MOE_MODE=2
MODEL_PATH=xxx
export ASCEND_MF_STORE_URL="tcp://PIP:24667"
P_IP=('PIP')
D_IP=('DIP1' 'DIP2')
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=327680
export HCCL_BUFFSIZE=1550
export TASK_QUEUE_ENABLE=2
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \
--host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \
--nnodes 1 --node-rank $i --tp-size 16 --dp-size 2 --mem-fraction-static 0.7 \
--disable-radix-cache \
--attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \
--max-running-requests 16 --chunked-prefill-size 20480 --max-prefill-tokens 20480 \
--enable-dp-attention \
--moe-a2a-backend ascend_fuseep --dtype bfloat16 \
--disable-overlap-schedule
NODE_RANK=$i
break
fi
done
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536
export HCCL_BUFFSIZE=600
export SGLANG_NPU_FUSED_MOE_MODE=2
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \
--host ${D_IP[$i]} --port 8001 --trust-remote-code \
--nnodes 2 --node-rank $i --tp-size 32 --dp-size 4 --mem-fraction-static 0.75 --max-running-requests 544 \
--attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \
--moe-a2a-backend ascend_fuseep --cuda-graph-bs 16 32 56 72 80 88 96 104 112 120 128 136 \
--dist-init-addr DIP1:5000 \
--disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 --load-balance-method round_robin
NODE_RANK=$i
break
fi
done
Command
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://PIP:8000 8995 \
--decode http://DIP:8001 \
--host 127.0.0.1 \
--port 6688 \
--mini-lb
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 410 --random-input-len 3500 --random-output-len 1500 --num-prompts 1640 --random-range-ratio 1 --request-rate 8
Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 16 Cards Mixed Mode
Model: Qwen3-Coder-480B-A35B-Instruct Hardware: Atlas 800I A3 16Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=72
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=1800
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
MIX_IP=('IP1' 'IP2')
for i in "${!MIX_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]];
then
echo "${MIX_IP[$i]}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 2 --node-rank $i \
--dist-init-addr 141.61.133.128:5000 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 288 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 114688 --max-prefill-tokens 458880 \
--disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto \
--tp 32 --dp-size 4 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.7 --cuda-graph-bs 56 64 72
NODE_RANK=$i
break
fi
done
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 288 --random-input-len 3500 --random-output-len 1500 --num-prompts 1152 --random-range-ratio 1 --request-rate 20
Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode
Model: Qwen3-Coder-480B-A35B-Instruct Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=2100
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 80 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 28672 --max-prefill-tokens 458880 \
--disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --enable-dp-attention --enable-dp-lm-head \
--tp 16 --dp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 16 20 24
Benchmark
We tested it based on theRANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 80 --random-input-len 3500 --random-output-len 1500 --num-prompts 320 --random-range-ratio 1
Qwen3-Next-80B-A3B-Instruct 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode
Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
export cann_path=/usr/local/Ascend/ascend-toolkit/latest
source /usr/local/Ascend/driver/bin/setenv.bash
source ${cann_path}/../set_env.sh
source ${cann_path}/../../nnal/atb/set_env.sh
source ${cann_path}/opp/vendors/customize/bin/set_env.bash
export ASCEND_HOME_PATH=${cann_path}
source /usr/local/Ascend/8.5.0/bisheng_toolkit/set_env.sh
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_ALGO="level0:NA;level1:ring"
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=20
export HCCL_BUFFSIZE=2000
python -m sglang.launch_server \
--model-path /path/to/Qwen3-Next-80B-A3B-Instruct-W8A8-3 \
--host 127.0.0.1 \
--port 6699 \
--tp-size 4 \
--device npu \
--attention-backend ascend \
--mem-fraction-static 0.685 \
--max-running-requests 80 \
--watchdog-timeout 3600 \
--disable-radix-cache \
--cuda-graph-bs 80 \
--max-prefill-tokens 28672 --max-total-tokens 450560 \
--moe-a2a-backend deepep --deepep-mode auto \
--quantization modelslim \
--chunked-prefill-size -1
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 80 --random-output-len 1536 --random-input-len 3584 --num-prompts 160 --random-range-ratio 1
Qwen3-32B 6K-1_5K 18ms on A2 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 18msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 32 \
--disable-radix-cache \
--chunked-prefill-size 24576 --max-prefill-tokens 65536 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1
Qwen3-32B 4K-1_5K 11ms on A2 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 4K+1.5K TPOT: 11msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu \
--max-running-requests 32 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 65536 \
--tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 6 12 18 24 30 32 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4
Qwen3-32B 1K-0_3K 12ms on A3 2 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 12msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 \
--tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 8 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16
Qwen3-32B 6K-1_5K 17ms on A3 2 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 17msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 \
--tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 10 15 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16
Qwen3-8B 1K-0_3K 7ms on A3 1 Cards Mixed Mode
Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 7msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 \
--tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 2 4 6 9 10 15 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16
Qwen3-8B 6K-1_5K 12ms on A3 1 Cards Mixed Mode
Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 12msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 16384 \
--tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 5 15 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16
Qwen3-32B 3_5K-1_5K 50ms on A2 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 78 \
--disable-radix-cache --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-prefill-tokens 65536 \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--tp-size 4 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 8 16 32 64 68 72 78 --dtype bfloat16 --base-gpu-id 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1
Qwen3-32B 2K-2K 50ms on A2 8 Cards Mixed Mode
Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 120 \
--disable-radix-cache \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-prefill-tokens 49152 --base-gpu-id 4 \
--tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 120 --random-range-ratio 1
Qwen3-30B-A3B 6K-1_5K 10ms on A3 1 Cards Mixed Mode
Model: Qwen3-30B-A3B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 10msModel Deployment
Command
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 16 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 35000 \
--tp-size 2 --mem-fraction-static 0.6 --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16
Qwen3-30B-A3B 1K-0_3K 7ms on A3 1 Cards Mixed Mode
Model: Qwen3-30B-A3B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 7msModel Deployment
Command
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--max-running-requests 8 \
--disable-radix-cache \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
--chunked-prefill-size -1 --max-prefill-tokens 35000 \
--tp-size 2 --mem-fraction-static 0.7 --cuda-graph-bs 1 2 3 4 5 6 7 8 --dtype bfloat16
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 8 --random-output-len 300 --random-input-len 1024 --num-prompts 8
Qwen3-Next 1K-0_3K 14_21ms on A3 2 Cards Mixed Mode
Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 14.21msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330
export DEEPEP_NORMAL_LONG_SEQ_ROUND=5
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=2000
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--page-size 128 \
--tp-size 4 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--mem-fraction-static 0.75 \
--disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-running-requests 312 \
--cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \
--mamba-ssm-dtype bfloat16 \
--base-gpu-id 0 \
--speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \
--quantization modelslim \
--moe-a2a-backend deepep --deepep-mode auto \
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16
Qwen3-Next 6K-1_5K 15_62ms on A3 2 Cards Mixed Mode
Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 15.62msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330
export DEEPEP_NORMAL_LONG_SEQ_ROUND=5
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1
MODEL_PATH=xxx
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=2000
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--page-size 128 \
--tp-size 4 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--mem-fraction-static 0.75 \
--disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-running-requests 312 \
--cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \
--mamba-ssm-dtype bfloat16 \
--base-gpu-id 0 \
--speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \
--quantization modelslim \
--moe-a2a-backend deepep --deepep-mode auto \
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16
Qwen3-14B 3_5K-1_5K 9ms on A3 1 Cards Mixed Mode
Model: Qwen3-14B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 9msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export ASCEND_USE_FIA=0
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--disable-radix-cache --mem-fraction-static 0.8 \
--tp-size 1 --dp-size 1 \
--sampling-backend ascend --max-running-requests 8 \
--served-model-name Qwen3-14B \
--chunked-prefill-size -1 \
--cuda-graph-bs 8 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--schedule-conservativeness 0.01
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 8 --random-range-ratio 1
Qwen3-14B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode
Model: Qwen3-14B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export ASCEND_USE_FIA=0
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--disable-radix-cache --mem-fraction-static 0.89 \
--tp-size 1 --dp-size 2 \
--sampling-backend ascend --max-running-requests 144 \
--max-prefill-tokens 12288 \
--served-model-name Qwen3-14B \
--chunked-prefill-size -1 \
--cuda-graph-bs 8 16 32 44 48 50 52 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--schedule-conservativeness 0.01
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 144 --random-output-len 1500 --random-input-len 3500 --num-prompts 576 --random-range-ratio 1
Qwen3-8B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode
Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=50
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--disable-radix-cache --mem-fraction-static 0.9 \
--tp-size 1 \
--max-running-requests 70 \
--max-prefill-tokens 16384 \
--served-model-name Qwen3-8B \
--chunked-prefill-size 16384 \
--cuda-graph-bs 8 12 24 36 48 51 55 60 63 64 66 68 70 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 64 --random-output-len 1500 --random-input-len 3500 --num-prompts 256 --random-range-ratio 1
Qwen3-8B 3_5K-1_5K 5ms on A3 1 Cards Mixed Mode
Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 5msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
MODEL_PATH=xxx
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization modelslim \
--disable-radix-cache --mem-fraction-static 0.894 \
--tp-size 2 \
--max-running-requests 1 \
--max-prefill-tokens 16384 \
--served-model-name Qwen3-8B \
--chunked-prefill-size -1 \
--cuda-graph-bs 1 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 4 --random-range-ratio 1
Qwen3-Next 3_5K-1_5K 20ms on A3 2 Cards Mixed Mode
Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 20msModel Deployment
Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=400
export DEEPEP_NORMAL_LONG_SEQ_ROUND=10
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export HCCL_OP_EXPANSION_MODE="AIV"
export TASK_QUEUE_ENABLE=1
export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=0
export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1
export HCCL_BUFFSIZE=2000
export ZBCCL_LOCAL_MEM_SIZE=60416
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export ZBCCL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export ZBCCL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export ZBCCL_ENABLE_GRAPH=1
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
MODEL_PATH=xxx
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--page-size 128 \
--tp-size 4 --dp-size 2 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--mem-fraction-static 0.85 \
--disable-radix-cache --max-prefill-tokens 28672 --context-length 26384 --max-total-tokens 122304 \
--enable-dp-attention --enable-dp-lm-head \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
--chunked-prefill-size -1 --max-running-requests 16 \
--cuda-graph-bs 2 4 8 \
--mamba-ssm-dtype bfloat16 \
--speculative-draft-model-path /path/to/Qwen3-Next-80B-A3B-Instruct
Benchmark
We tested it based on theRANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 1
