Skip to main content
This section describes the best practice data of mainstream LLM models such as DeepSeek and Qwen on the Ascend NPU. If you encounter issues or have any questions, please open an issue.

DeepSeek Series Models

Low Latency

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Deepseek-R1Atlas 800I A332PD Disaggregation6K+1.6K20msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A332PD Disaggregation3.9K+1K19msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A332PD Disaggregation3.5K+1.5K19msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A332PD Disaggregation3.5K+1K19msW8A8 INT8Optimal Configuration
DeepSeek-V3.2Atlas 800I A332PD Disaggregation128K+1K26msW8A8 INT8Optimal Configuration

High Throughput

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Deepseek-R1Atlas 800I A332PD Disaggregation3.5K+1.5K50msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A324PD Disaggregation2K+2K50msW8A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A38PD Mixed2K+2K50msW4A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A316PD Disaggregation2K+2K50msW4A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A38PD Mixed3.5K+1.5K50msW4A8 INT8Optimal Configuration
Deepseek-R1Atlas 800I A316PD Disaggregation3.5K+1.5K50msW4A8 INT8Optimal Configuration

Qwen Series Models

Low Latency

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Qwen3-235B-A22BAtlas 800I A38PD Mixed11K+1K10msBF16Optimal Configuration
Qwen3-32BAtlas 800I A34PD Mixed6K+1.5K18msBF16Optimal Configuration
Qwen3-32BAtlas 800I A34PD Mixed4K+1.5K11msBF16Optimal Configuration
Qwen3-32BAtlas 800I A38PD Mixed18K+4K6msBF16Optimal Configuration
Qwen3-32BAtlas 800I A28PD Mixed6K+1.5K18msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A28PD Mixed4K+1.5K11msBF16Optimal Configuration
Qwen3-32BAtlas 800I A32PD Mixed1K+0.3K12msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A32PD Mixed6K+1.5K17msW8A8 INT8Optimal Configuration
Qwen3-8BAtlas 800I A31PD Mixed1K+0.3K7msW8A8 INT8Optimal Configuration
Qwen3-8BAtlas 800I A31PD Mixed6K+1.5K12msW8A8 INT8Optimal Configuration
Qwen3-8BAtlas 800I A31PD Mixed3.5K+1.5K5msW8A8 INT8Optimal Configuration
Qwen3-30B-A3BAtlas 800I A31PD Mixed6K+1.5K10msW8A8 INT8Optimal Configuration
Qwen3-30B-A3BAtlas 800I A31PD Mixed1K+0.3K7msW8A8 INT8Optimal Configuration
Qwen3-Next-A3B-InstructAtlas 800I A32PD Mixed1K+0.3K14.21msW8A8 INT8Optimal Configuration
Qwen3-Next-A3B-InstructAtlas 800I A32PD Mixed6K+1.5K15.62msW8A8 INT8Optimal Configuration
Qwen3-Next-A3B-InstructAtlas 800I A31PD Mixed3.5K+1.5K20msW8A8 INT8Optimal Configuration
Qwen3-14BAtlas 800I A31PD Mixed3.5K+1.5K9msW8A8 INT8Optimal Configuration
Qwen3.5-27BAtlas 800I A32PD Mixed3.5K+1.5K20msW8A8 INT8Optimal Configuration
Qwen3.5-27BAtlas 800I A31PD Mixed16K+1K20msW8A8 INT8Optimal Configuration
Qwen3.5-27BAtlas 800I A31PD Mixed64K+1K20msW8A8 INT8Optimal Configuration
Qwen3.5-397B-A17BAtlas 800I A38PD Mixed3.5K+1.5K22msW4A8Optimal Configuration

High Throughput

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Qwen3-235B-A22BAtlas 800I A324PD Disaggregation3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-235B-A22BAtlas 800I A38PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-235B-A22BAtlas 800I A38PD Mixed2K+2K50msW8A8 INT8Optimal Configuration
Qwen3-235B-A22BAtlas 800I A316PD Mixed2K+2K50msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A32PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A32PD Mixed2K+2K50msW8A8 INT8Optimal Configuration
Qwen3-30B-A3BAtlas 800I A31PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-Coder-480B-A35B-InstructAtlas 800I A324PD Disaggregation3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-Coder-480B-A35B-InstructAtlas 800I A316PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-Coder-480B-A35B-InstructAtlas 800I A38PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-Next-80B-A3B-InstructAtlas 800I A32PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A28PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-32BAtlas 800I A28PD Mixed2K+2K50msW8A8 INT8Optimal Configuration
Qwen3-14BAtlas 800I A31PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3-8BAtlas 800I A31PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3.5-27BAtlas 800I A31PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3.5-27BAtlas 800I A32PD Mixed16K+1K50msW8A8 INT8Optimal Configuration
Qwen3.5-27BAtlas 800I A32PD Mixed64K+1K50msW8A8 INT8Optimal Configuration
Qwen3.5-397B-A17BAtlas 800I A38PD Mixed3.5K+1.5K50msW4A8Optimal Configuration

MiniMax Series Models

Low Latency

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
MiniMax-M2.5Atlas 800I A38PD Mixed3.5K+1.5K20msW8A8 INT8Optimal Configuration
MiniMax-M2.5Atlas 800I A38PD Mixed128K+1K20msW8A8 INT8Optimal Configuration

High Throughput

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
MiniMax-M2.5Atlas 800I A38PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
MiniMax-M2.5Atlas 800I A38PD Mixed64K+1K50msW8A8 INT8Optimal Configuration
MiniMax-M2.5Atlas 800I A38PD Mixed128K+1K50msW8A8 INT8Optimal Configuration
MiniMax-M2.5Atlas 800I A34PD Mixed64K+1K50msW8A8 INT8Optimal Configuration
MiniMax-M2.5Atlas 800I A316PD Disaggregation64K+1K50msW8A8 INT8Optimal Configuration
MiniMax-M2.5Atlas 800I A316PD Disaggregation128K+1K50msW8A8 INT8Optimal Configuration

Kimi Series Models

Low Latency

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Kimi-K2.5-w4a8Atlas 800I A38PD Mixed3.5K+1.5K20msW4A8 INT8Optimal Configuration

High Throughput

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Kimi-K2.5-w4a8Atlas 800I A38PD Mixed3.5K+1.5K50msW4A8 INT8Optimal Configuration

GLM Series Models

High Throughput

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
GLM-5.1Atlas 800I A316PD Mixed3.5K+1.5K41msW4A8Optimal Configuration
GLM-5.1Atlas 800I A332PD Disaggregation16K+1K23msW4A8Optimal Configuration
GLM-5.1Atlas 800I A348PD Disaggregation64K+1K+90% cache hit45msW4A8Optimal Configuration
GLM-5.1Atlas 800I A348PD Disaggregation128K+1K+90% cache hit32msW4A8Optimal Configuration

Optimal Configuration

DeepSeek-R1 3_5K-1_5K 50ms on A3 32 Cards Disaggregation Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export SGLANG_NPU_USE_MULTI_STREAM=1

export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"

P_IP=('your prefill ip1' 'your prefill ip2')

D_IP=('your decode ip1' 'your decode ip2')

MODEL_PATH=xxx

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export SGLANG_USE_AG_AFTER_QLORA=1
        export HCCL_BUFFSIZE=800
        export TASK_QUEUE_ENABLE=2
        export SGLANG_NPU_FUSED_MOE_MODE=2
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=131072

        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo
        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
        --tp-size 16 --mem-fraction-static 0.778 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 16 --disable-radix-cache \
        --chunked-prefill-size -1 --max-prefill-tokens 60000 --moe-a2a-backend ascend_fuseep --deepep-mode normal \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2  \
        --dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=600
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export TASK_QUEUE_ENABLE=1
        export SGLANG_NPU_FUSED_MOE_MODE=1
        export SGLANG_LM_HEAD_TP=8
        export HCCL_SOCKET_IFNAME=xxx
        export GLOO_SOCKET_IFNAME=xxx
        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \
        --mem-fraction-static 0.82 --max-running-requests 1024 --attention-backend ascend --device npu --quantization modelslim \
        --moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --moe-dense-tp 1 \
        --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2  \
        --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
        --load-balance-method round_robin
        NODE_RANK=$i
        break
    fi
done

Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://P_IP:8000 8998 \
    --prefill http://P_IP:8000 8999 \
    --decode http://D_IP:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --mini-lb

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 1024 --random-input-len 3584 --random-output-len 1536 --num-prompts 7168 --random-range-ratio 1 --request-rate 40

DeepSeek-R1 2K-2K 50ms on A3 24 Cards Disaggregation Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 24Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 2K+2K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1

export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"

P_IP=('your prefill ip1')
D_IP=('your decode ip1' 'your decode ip2')

MODEL_PATH=xxx

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export HCCL_BUFFSIZE=1600
        export TASK_QUEUE_ENABLE=2
        export SGLANG_USE_AG_AFTER_QLORA=1
        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo

        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
        --tp-size 16 --mem-fraction-static 0.8 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 20 --context-length 8192 --disable-radix-cache \
        --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2  \
        --dp-size 4 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=800
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=102
        export TASK_QUEUE_ENABLE=1
        export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
        export SGLANG_NPU_FUSED_MOE_MODE=1
        export HCCL_SOCKET_IFNAME=xxx
        export GLOO_SOCKET_IFNAME=xxx

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8001 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 \
        --mem-fraction-static 0.81 --max-running-requests 1088 --attention-backend ascend --device npu --quantization modelslim \
        --moe-a2a-backend ascend_fuseep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \
        --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
        --speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3  \
        --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
        --load-balance-method round_robin
        NODE_RANK=$i
        break
    fi
done

Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://P_IP:8000 8998 \
    --prefill http://P_IP:8000 8999 \
    --decode http://D_IP:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --mini-lb

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 1088 \
--random-input-len 2048 \
--random-output-len 2048 \
--num-prompts 12800 \
--random-range-ratio 1 \
--request-rate 24

DeepSeek-R1 6K-1_6K 20ms on A3 32 Cards Disaggregation Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 6K+1.6K TPOT: 20ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"

P_IP=('your prefill ip1' 'your prefill ip2')

D_IP=('your decode ip1' 'your decode ip2')

MODEL_PATH=xxx

export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export HCCL_BUFFSIZE=1536
        export TASK_QUEUE_ENABLE=2

        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo
        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
        --tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 4 --disable-radix-cache \
        --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2  \
        --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=650
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16
        export TASK_QUEUE_ENABLE=1
        export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
        export HCCL_SOCKET_IFNAME=xxx
        export GLOO_SOCKET_IFNAME=xxx
        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 8 \
        --mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \
        --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \
        --cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  \
        --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
        --load-balance-method round_robin
        NODE_RANK=$i
        break
    fi
done

Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://P_IP:8000 8998 \
    --prefill http://P_IP:8000 8999 \
    --decode http://D_IP:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --mini-lb

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 32 \
    --random-input-len 6000 \
    --random-output-len 1600 \
    --num-prompts 32 \
    --random-range-ratio 1 \
    --request-rate 16

DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.9K+1K TPOT: 19ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24669"

P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')

MODEL_PATH=xxx

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export HCCL_BUFFSIZE=1536
        export TASK_QUEUE_ENABLE=2
        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo

        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
        --tp-size 16 --mem-fraction-static 0.81 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 4 --context-length 8192 --disable-radix-cache \
        --chunked-prefill-size -1 --max-prefill-tokens 28680 --moe-a2a-backend deepep --deepep-mode normal \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2  \
        --dp-size 2 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16 --enable-attn-tp-input-scattered
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=650
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=12
        export TASK_QUEUE_ENABLE=1
        export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1
        export HCCL_SOCKET_IFNAME=xxx
        export GLOO_SOCKET_IFNAME=xxx
        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8001 --trust-remote-code --dist-init-addr DIP1:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 16 \
        --mem-fraction-static 0.75 --max-running-requests 32 --attention-backend ascend --device npu --quantization modelslim \
        --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head --moe-dense-tp 1 \
        --cuda-graph-bs 2 4 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  \
        --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16 \
        --load-balance-method round_robin
        NODE_RANK=$i
        break
    fi
done
Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://P_IP:8000 8998 \
    --prefill http://P_IP:8000 8999 \
    --decode http://D_IP:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --mini-lb

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 32 \
    --random-input-len 3900 \
    --random-output-len 1024 \
    --num-prompts 32 \
    --random-range-ratio 1 \
    --request-rate 16

DeepSeek-R1 3_5K-1_5K 19ms on A3 32 Cards Disaggregation Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 19ms

Model Deployment

Please Turn to DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 32 \
    --random-input-len 3500 \
    --random-output-len 1500 \
    --num-prompts 32 \
    --random-range-ratio 1 \
    --request-rate 16

DeepSeek-R1 3_5K-1K 19ms on A3 32 Cards Disaggregation Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1K TPOT: 19ms

Model Deployment

Please Turn to DeepSeek-R1 3_9K-1K 19ms on A3 32 Cards Disaggregation Mode

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 32 \
    --random-input-len 3500 \
    --random-output-len 1024 \
    --num-prompts 32 \
    --random-range-ratio 1 \
    --request-rate 16

DeepSeek-R1 2K-2K 50ms on A3 8 Cards Mixed Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200

export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=88
export HCCL_BUFFSIZE=1600
export DEEPEP_NORMAL_LONG_SEQ_ROUND=10
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512

MODEL_PATH=xxx

export SGLANG_NPU_USE_MLAPO=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_USE_FIA_NZ=1

python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 16 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--cuda-graph-bs 4 8 20 21 22 \
--mem-fraction-static 0.78 \
--max-running-requests 352 \
--disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 1500 \
--moe-a2a-backend deepep --deepep-mode auto \
--enable-dp-attention --dp-size 16 --enable-dp-lm-head \
--speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 \
--dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 352  --random-input-len 2048 --random-output-len 2048 --num-prompts 1408 --random-range-ratio 1

DeepSeek-R1 2K-2K 50ms on A3 16 Cards Disaggregation Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 2K+2K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32

export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667"

P_IP=('your prefill ip1')

D_IP=('your decode ip1')

MODEL_PATH=xxx

export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ENABLE_MOE_NZ=1

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export HCCL_BUFFSIZE=2600
        export TASK_QUEUE_ENABLE=2

        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo
        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
        --tp-size 16 --mem-fraction-static 0.7 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192  --disable-radix-cache \
        --chunked-prefill-size -1 --max-prefill-tokens 10240 --moe-a2a-backend deepep --deepep-mode normal \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2  \
        --dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=900
        export SGLANG_DP_ROUND_ROBIN=1
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=112
        export TASK_QUEUE_ENABLE=1
        export HCCL_SOCKET_IFNAME=xxx
        export GLOO_SOCKET_IFNAME=xxx
        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \
        --mem-fraction-static 0.8 --max-running-requests 448 --attention-backend ascend --device npu --quantization modelslim \
        --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \
        --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  \
        --disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \
        --load-balance-method round_robin
        NODE_RANK=$i
        break
    fi
done

Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://P_IP:8000 8998 \
    --decode http://D_IP:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --mini-lb

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 448  --random-input-len 2048 --random-output-len 2048 --num-prompts 1792 --random-range-ratio 1 --request-rate 32

DeepSeek-R1 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=56
export HCCL_BUFFSIZE=1200
export DEEPEP_NORMAL_LONG_SEQ_ROUND=10
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=512
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_USE_FIA_NZ=1

MODEL_PATH=xxx

python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
--tp 16 \
--trust-remote-code \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--watchdog-timeout 9000 \
--host 127.0.0.1 --port 6699 \
--cuda-graph-bs 4 8 12 14 \
--mem-fraction-static 0.77 \
--max-running-requests 224 \
--context-length 8188  --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 3000 \
--moe-a2a-backend deepep --deepep-mode auto \
--enable-dp-attention --dp-size 16 --enable-dp-lm-head \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 224  --random-input-len 3500 --random-output-len 1500 --num-prompts 896 --random-range-ratio 1

DeepSeek-R1 3_5K-1_5K 50ms on A3 16 Cards Disaggregation Mode

Model: Deepseek R1 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32

export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667"

P_IP=('your prefill ip1')

D_IP=('your decode ip1')

MODEL_PATH=xxx

export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ENABLE_MOE_NZ=1

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export HCCL_BUFFSIZE=3500
        export TASK_QUEUE_ENABLE=2

        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo
        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
        --tp-size 16 --mem-fraction-static 0.62 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192  --disable-radix-cache \
        --chunked-prefill-size -1 --max-prefill-tokens 20480 --moe-a2a-backend deepep --deepep-mode normal \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2  \
        --dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=800
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=78
        export TASK_QUEUE_ENABLE=1
        export HCCL_SOCKET_IFNAME=xxx
        export GLOO_SOCKET_IFNAME=xxx
        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \
        --mem-fraction-static 0.805 --max-running-requests 416 --attention-backend ascend --device npu --quantization modelslim \
        --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \
        --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
        --speculative-algorithm NEXTN --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3  \
        --disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \
		--load-balance-method round_robin
        NODE_RANK=$i
        break
    fi
done

Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://P_IP:8000 8998 \
    --decode http://D_IP:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --mini-lb

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 416  --random-input-len 3500 --random-output-len 1500 --num-prompts 1664 --random-range-ratio 1

DeepSeek-V3.2 128K-1K 26ms on A3 32 Cards Disaggregation Mode

Model: DeepSeek-V3.2-W8A8 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 128K+1K TPOT: 26ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24670"

P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')
MODEL_PATH=xxx

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export HCCL_BUFFSIZE=1200
        export TASK_QUEUE_ENABLE=2
        export HCCL_SOCKET_IFNAME=xxx
        export GLOO_SOCKET_IFNAME=xxx

        python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
        --tp 32 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --watchdog-timeout 9000 \
        --host ${P_IP[$i]} --port 8000 \
        --mem-fraction-static 0.73 \
        --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 68000 \
        --max-running-requests 1 \
        --moe-a2a-backend deepep --deepep-mode normal \
        --quantization modelslim \
        --disaggregation-transfer-backend ascend \
        --disaggregation-mode prefill \
        --disable-cuda-graph \
        --nnodes 2 --node-rank $i \
        --disaggregation-bootstrap-port 8995 \
        --moe-dense-tp-size 1 \
	    --enable-dsa-prefill-context-parallel \
        --dsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 32 \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
        --dist-init-addr ${P_IP[0]}:10000
        break
    fi
done


# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1

        export TASK_QUEUE_ENABLE=0
        export SGLANG_SCHEDULER_SKIP_ALL_GATHER=1

        export HCCL_SOCKET_IFNAME=xxx
        export GLOO_SOCKET_IFNAME=xxx

        DP=8
        export HCCL_BUFFSIZE=400
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=8

        python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
        --tp 32 \
        --dp ${DP} \
        --ep 32 \
        --moe-dense-tp-size 1 \
        --enable-dp-attention \
        --enable-dp-lm-head \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --watchdog-timeout 9000 \
        --host ${D_IP[$i]} --port 8001 \
        --mem-fraction-static 0.79 \
        --disable-radix-cache \
        --chunked-prefill-size -1 --max-prefill-tokens 68000 \
        --max-running-requests 32 \
        --cuda-graph-max-bs 4 \
        --moe-a2a-backend deepep \
        --deepep-mode low_latency \
        --quantization modelslim \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
        --disaggregation-transfer-backend ascend \
        --disaggregation-mode decode \
        --nnodes 2 --node-rank $i \
        --dist-init-addr ${D_IP[0]}:10000
        break
    fi
done
Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://P_IP1:8000 8995 \
    --decode http://D_IP1:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --mini-lb

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --max-concurrency 8  --random-input-len 131076 --random-output-len 1024 --num-prompts 8 --random-range-ratio 1

Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode

Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 24Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_DP_ROUND_ROBIN=1
export SGLANG_NPU_FUSED_MOE_MODE=2

MODEL_PATH=xxx
export ASCEND_MF_STORE_URL="tcp://your prefill ip1:24667"
P_IP=('your prefill ip1')
D_IP=('your decode ip1' 'your decode ip2')

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"


for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        source /usr/local/Ascend/ascend-toolkit/set_env.sh
        source /usr/local/Ascend/nnal/atb/set_env.sh
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416
        export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
        export DEEPEP_NORMAL_LONG_SEQ_ROUND=16
        export HCCL_BUFFSIZE=4300
        export TASK_QUEUE_ENABLE=2
        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo
        export STREAMS_PER_DEVICE=32

        # Prefill
        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \
        --host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \
        --nnodes 1 --node-rank $i --tp-size 16 --dp-size 16 --mem-fraction-static 0.6 \
        --disable-radix-cache \
        --attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \
        --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
        --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
        --speculative-draft-model-quantization unquant \
        --max-running-requests 128 --chunked-prefill-size 94208 --max-prefill-tokens 262144 \
        --enable-dp-attention  \
        --moe-a2a-backend ascend_fuseep --dtype bfloat16
        NODE_RANK=$i
        break
    fi
done


for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        source /usr/local/Ascend/ascend-toolkit/set_env.sh
        source /usr/local/Ascend/nnal/atb/set_env.sh
        export DP_ROUND_ROBIN=1
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536
        export HCCL_BUFFSIZE=800
        export HCCL_SOCKET_IFNAME=data0.3001
        export GLOO_SOCKET_IFNAME=data0.3001
        export STREAMS_PER_DEVICE=32

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \
        --host ${D_IP[$i]} --port 8001 --trust-remote-code \
        --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.83 --max-running-requests 768 \
        --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \
        --moe-a2a-backend ascend_fuseep --cuda-graph-bs 6 8 12 15 18 20 22 24 \
        --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
        --speculative-draft-model-quantization unquant \
        --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
        --dist-init-addr xxx:5000 \
        --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
        --enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 \
        --load-balance-method round_robin
        NODE_RANK=$i
        break
    fi
done

Command
export SGLANG_DP_ROUND_ROBIN=1
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://PIP:8000 8995 \
    --decode http://DIP:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --mini-lb

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang-oai --host 127.0.0.1 --port 7239 --max-concurrency 860 --random-input-len 3500 --random-output-len 1500 --num-prompts 3440 --random-range-ratio 1

Qwen3-235B-A22B 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode

Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=570
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100

export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=188416
export SGLANG_NPU_FUSED_MOE_MODE=2

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu --quantization modelslim  \
    --max-running-requests 432 --context-length 8192 --dtype bfloat16 \
    --chunked-prefill-size 94208 --max-prefill-tokens 458880 --sampling-backend ascend \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --disable-radix-cache --moe-a2a-backend ascend_fuseep --speculative-draft-model-quantization unquant \
    --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.8 --cuda-graph-bs 1 2 4 8 16 20 24 26 27

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 272 --random-input-len 3500 --random-output-len 1500 --num-prompts 1088 --random-range-ratio 1

Qwen3-235B-A22B 2K-2K 50ms on A3 8 Cards Mixed Mode

Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

MODEL_PATH=xxx

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=450
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=147456
export SGLANG_NPU_FUSED_MOE_MODE=2

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu --quantization modelslim  \
    --max-running-requests 624 --context-length 8192 --dtype bfloat16 \
    --chunked-prefill-size 73728 --max-prefill-tokens 458880 --speculative-draft-model-quantization unquant  \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --disable-radix-cache --moe-a2a-backend ascend_fuseep \
    --tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.83 --cuda-graph-bs 4 8 16 24 28 29 30 32 34 36 37 38 39

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 480 --random-input-len 2048 --random-output-len 2048 --num-prompts 480 --random-range-ratio 1

Qwen3-235B-A22B 2K-2K 50ms on A3 16 Cards Mixed Mode

Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 16Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

MODEL_PATH=xxx

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=1600
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"

MIX_IP=('IP1' 'IP2')

for i in "${!MIX_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]];
    then
        echo "${MIX_IP[$i]}"
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1

        python -m sglang.launch_server --model-path ${MODEL_PATH} \
        --host 127.0.0.1 --port 7439 --trust-remote-code \
        --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --mem-fraction-static 0.8 --max-running-requests 768 \
        --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \
        --moe-a2a-backend deepep --deepep-mode auto --cuda-graph-bs 6 8 10 12 18 24 \
        --dist-init-addr ${MIX_IP[0]}:5000 --chunked-prefill-size 131072 --max-prefill-tokens 458880 \
        --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx --speculative-draft-model-quantization= unquant \
        --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
        --context-length 8192 --disable-radix-cache \
        --enable-dp-lm-head --dtype bfloat16
        NODE_RANK=$i
        break
    fi
done

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 768 --random-input-len 2000 --random-output-len 2000 --num-prompts 768 --random-range-ratio 1

Qwen3-235B-A22B 11K-1K 10ms on A3 8 Cards Mixed Mode

Model: Qwen3-235B-A22B-W8A8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 11K+1K TPOT: 10ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=1600
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu --quantization modelslim  \
    --max-running-requests 1  --dtype bfloat16 \
    --chunked-prefill-size -1 --max-prefill-tokens 16384 --speculative-draft-model-quantization unquant  \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --disable-radix-cache --enable-dp-lm-head \
    --tp 16 --mem-fraction-static 0.78 --cuda-graph-bs 1

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 1 --random-input-len 11000 --random-output-len 1000 --num-prompts 1 --random-range-ratio 1

Qwen3-32B 6K-1_5K 18ms on A3 4 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A3 4Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 18ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu \
    --max-running-requests 32 \
    --disable-radix-cache \
    --chunked-prefill-size 24576 --max-prefill-tokens 65536 \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32  --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1

Qwen3-32B 4K-1_5K 11ms on A3 4 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A3 4Card DeployMode: PD Mixed Dataset: random Input Output Length: 4K+1.5K TPOT: 11ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu   \
    --max-running-requests 1 \
    --disable-radix-cache \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --chunked-prefill-size 24576 --max-prefill-tokens 65536  \
    --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4

Qwen3-32B 18K-4K 6ms on A3 8 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 18K+4K TPOT: 6ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu   \
    --max-running-requests 1 \
    --disable-radix-cache --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --chunked-prefill-size -1 --max-prefill-tokens 65536  \
    --tp-size 16 --mem-fraction-static 0.72 --cuda-graph-bs 1 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 18000 --random-input-len 4000 --num-prompts 1

Qwen3-32B 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH


MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu  --quantization modelslim  \
    --max-running-requests 78 \
    --disable-radix-cache --speculative-draft-model-quantization unquant \
    --chunked-prefill-size -1 --max-prefill-tokens 49152  \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --tp-size 4  --mem-fraction-static 0.72 --cuda-graph-bs 16 32 64 68 72 78 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1

Qwen3-32B 2K-2K 50ms on A3 2 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu  --quantization modelslim  \
    --max-running-requests 120 \
    --disable-radix-cache --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --chunked-prefill-size -1 --max-prefill-tokens 49152 \
    --tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 480 --random-range-ratio 1

Qwen3-30B-A3B 3_5K-1_5K 50ms on A3 1 Card Mixed Mode

Model: Qwen3-30B-A3B-Instruct-2507 Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_SET_CPU_AFFINITY=1
export ASCEND_LAUNCH_BLOCKING=0
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu  --quantization modelslim  \
    --max-running-requests 162 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --chunked-prefill-size -1 --max-prefill-tokens 35000 \
    --tp-size 2 --mem-fraction-static 0.87 --cuda-graph-bs 1 5 15 40 70 100 120 130 140 146 150 154 156 158 160 162 \
    --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 156 --random-input-len 3500 --random-output-len 1500 --num-prompts 624 --random-range-ratio 1

Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 24 Cards Disaggregation Mode

Model: Qwen3-Coder-480B-A35B-Instruct Hardware: Atlas 800I A3 24Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_NPU_FUSED_MOE_MODE=2

MODEL_PATH=xxx
export ASCEND_MF_STORE_URL="tcp://PIP:24667"
P_IP=('PIP')
D_IP=('DIP1' 'DIP2')
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"


for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        source /usr/local/Ascend/ascend-toolkit/set_env.sh
        source /usr/local/Ascend/nnal/atb/set_env.sh
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=327680
        export HCCL_BUFFSIZE=1550
        export TASK_QUEUE_ENABLE=2
        export HCCL_SOCKET_IFNAME=lo
        export GLOO_SOCKET_IFNAME=lo

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill \
        --host ${P_IP[$i]} --port 8000 --disaggregation-bootstrap-port 8995 --trust-remote-code \
        --nnodes 1 --node-rank $i --tp-size 16 --dp-size 2 --mem-fraction-static 0.7 \
        --disable-radix-cache \
	    --attention-backend ascend --device npu --quantization modelslim --disaggregation-transfer-backend ascend \
	    --max-running-requests 16 --chunked-prefill-size 20480 --max-prefill-tokens 20480 \
        --enable-dp-attention  \
        --moe-a2a-backend ascend_fuseep --dtype bfloat16 \
        --disable-overlap-schedule
        NODE_RANK=$i
        break
    fi
done

for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        source /usr/local/Ascend/ascend-toolkit/set_env.sh
        source /usr/local/Ascend/nnal/atb/set_env.sh
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=65536
        export HCCL_BUFFSIZE=600
        export SGLANG_NPU_FUSED_MOE_MODE=2
        export HCCL_SOCKET_IFNAME=xxx
        export GLOO_SOCKET_IFNAME=xxx

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode \
        --host ${D_IP[$i]} --port 8001 --trust-remote-code \
        --nnodes 2 --node-rank $i --tp-size 32 --dp-size 4 --mem-fraction-static 0.75 --max-running-requests 544 \
        --attention-backend ascend --device npu --quantization modelslim --enable-dp-attention \
        --moe-a2a-backend ascend_fuseep --cuda-graph-bs 16 32 56 72 80 88 96 104 112 120 128 136 \
        --dist-init-addr DIP1:5000 \
	    --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
        --enable-dp-lm-head --dtype bfloat16 --tokenizer-worker-num 4 --load-balance-method round_robin
        NODE_RANK=$i
        break
    fi
done

Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://PIP:8000 8995 \
    --decode http://DIP:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --mini-lb

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 410 --random-input-len 3500 --random-output-len 1500 --num-prompts 1640 --random-range-ratio 1 --request-rate 8

Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 16 Cards Mixed Mode

Model: Qwen3-Coder-480B-A35B-Instruct Hardware: Atlas 800I A3 16Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=72
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

MODEL_PATH=xxx

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=1800
export HCCL_SOCKET_IFNAME=xxx
export GLOO_SOCKET_IFNAME=xxx
export HCCL_OP_EXPANSION_MODE="AIV"

MIX_IP=('IP1' 'IP2')

for i in "${!MIX_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${MIX_IP[$i]}" || "$LOCAL_HOST2" == "${MIX_IP[$i]}" ]];
    then
        echo "${MIX_IP[$i]}"

        python -m sglang.launch_server --model-path $MODEL_PATH \
        --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 2 --node-rank $i  \
        --dist-init-addr 141.61.133.128:5000 \
        --attention-backend ascend --device npu --quantization modelslim  \
        --max-running-requests 288 --context-length 8192 --dtype bfloat16  \
        --chunked-prefill-size 114688 --max-prefill-tokens 458880  \
        --disable-radix-cache --moe-a2a-backend deepep  --deepep-mode auto  \
        --tp 32 --dp-size 4 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.7 --cuda-graph-bs 56 64 72
        NODE_RANK=$i
        break
    fi
done

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 288 --random-input-len 3500 --random-output-len 1500 --num-prompts 1152 --random-range-ratio 1 --request-rate 20

Qwen3-Coder-480B-A35B-Instruct 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode

Model: Qwen3-Coder-480B-A35B-Instruct Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

MODEL_PATH=xxx

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=2100
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"

python -m sglang.launch_server --model-path $MODEL_PATH \
--host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0  \
--attention-backend ascend --device npu --quantization modelslim  \
--max-running-requests 80 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 28672 --max-prefill-tokens 458880  \
--disable-radix-cache --moe-a2a-backend deepep  --deepep-mode auto --enable-dp-attention --enable-dp-lm-head \
--tp 16 --dp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs  16 20 24

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 80 --random-input-len 3500 --random-output-len 1500 --num-prompts 320 --random-range-ratio 1

Qwen3-Next-80B-A3B-Instruct 3_5K-1_5K 50ms on A3 2 Cards Mixed Mode

Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
export cann_path=/usr/local/Ascend/ascend-toolkit/latest
source /usr/local/Ascend/driver/bin/setenv.bash
source ${cann_path}/../set_env.sh
source ${cann_path}/../../nnal/atb/set_env.sh
source ${cann_path}/opp/vendors/customize/bin/set_env.bash
export ASCEND_HOME_PATH=${cann_path}
source /usr/local/Ascend/8.5.0/bisheng_toolkit/set_env.sh

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
export LD_LIBRARY_PATH=/usr/local/Ascend/cann-9.0.0/opp/vendors/custom_transformer/op_api/lib:${LD_LIBRARY_PATH}

export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_ALGO="level0:NA;level1:ring"

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330
export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=0
export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1

ZBAL_HCCL_OP="allreduce,_allgather_base,allgather,broadcast,scatter,reduce_scatter,_reduce_scatter_base,alltoall_base"
export HCCL_BUFFSIZE=64
export SGLANG_ZBAL_LOCAL_MEM_SIZE=59648
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://127.0.0.1:24669"

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export ZBAL_ENABLE_GRAPH=1
MODEL_PATH=/home/weights/Qwen3-Next-80B-A3B-Instruct-W8A8

python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
    --page-size 128 \
    --tp-size 4 \
    --trust-remote-code \
    --attention-backend ascend \
    --device npu \
    --watchdog-timeout 9000 \
    --host 127.0.0.1 --port 6699 \
    --mem-fraction-static 0.75 \
    --disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \
    --chunked-prefill-size -1 --max-running-requests 300 \
    --mamba-ssm-dtype bfloat16 \
    --quantization modelslim \
    --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  --speculative-draft-model-quantization  unquant \
    --speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \
    --dp-size 2 --enable-dp-attention --enable-dp-lm-head \
    --moe-a2a-backend deepep --deepep-mode auto \
    --cuda-graph-bs 1 2 3 4 5 6 7 8 10 12 14 16 18 20 22 24 26 28 30 32 40 44 48 52 56 60 64 72 80 88 96 104 112 120 128 136 144 150

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --max-concurrency 300 --random-output-len 1536 --random-input-len 3584 --num-prompts 300 --random-range-ratio 1

Qwen3-32B 6K-1_5K 18ms on A2 8 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 18ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7439 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu  --quantization modelslim  \
    --max-running-requests 32 \
    --disable-radix-cache \
    --chunked-prefill-size 24576 --max-prefill-tokens 65536 \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 8 16 24 32 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7439 --max-concurrency 32 --random-output-len 1500 --random-input-len 6000 --num-prompts 32 --random-range-ratio 1

Qwen3-32B 4K-1_5K 11ms on A2 8 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 4K+1.5K TPOT: 11ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu   \
    --max-running-requests 32 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx  \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --chunked-prefill-size -1 --max-prefill-tokens 65536  \
    --tp-size 8 --mem-fraction-static 0.72 --cuda-graph-bs 1 4 6 12 18 24 30 32 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving  --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 4096 --num-prompts 4

Qwen3-32B 1K-0_3K 12ms on A3 2 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 12ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu --quantization modelslim \
    --max-running-requests 16 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --chunked-prefill-size -1 --max-prefill-tokens 16384  \
    --tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 8 16 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving  --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16

Qwen3-32B 6K-1_5K 17ms on A3 2 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 17ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu --quantization modelslim \
    --max-running-requests 16 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --chunked-prefill-size -1 --max-prefill-tokens 16384  \
    --tp-size 4 --mem-fraction-static 0.843 --cuda-graph-bs 1 4 10 15 16 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving  --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16

Qwen3-8B 1K-0_3K 7ms on A3 1 Cards Mixed Mode

Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 7ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu --quantization modelslim \
    --max-running-requests 16 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --chunked-prefill-size -1 --max-prefill-tokens 16384  \
    --tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 2 4 6 9 10 15 16 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving  --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16

Qwen3-8B 6K-1_5K 12ms on A3 1 Cards Mixed Mode

Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 12ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu --quantization modelslim \
    --max-running-requests 16 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --chunked-prefill-size -1 --max-prefill-tokens 16384  \
    --tp-size 2 --mem-fraction-static 0.894 --cuda-graph-bs 1 5 15 16 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving  --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16

Qwen3-32B 3_5K-1_5K 50ms on A2 8 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu \
    --max-running-requests 78 \
    --disable-radix-cache --speculative-draft-model-quantization unquant \
    --chunked-prefill-size -1 --max-prefill-tokens 65536  \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --tp-size 4  --mem-fraction-static 0.72 --cuda-graph-bs 1 4 8 16 32 64 68 72 78 --dtype bfloat16 --base-gpu-id 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 78 --random-output-len 1500 --random-input-len 3500 --num-prompts 312 --random-range-ratio 1

Qwen3-32B 2K-2K 50ms on A2 8 Cards Mixed Mode

Model: Qwen3-32B Hardware: Atlas 800I A2 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 2K+2K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu \
    --max-running-requests 120 \
    --disable-radix-cache \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
    --chunked-prefill-size -1 --max-prefill-tokens 49152 --base-gpu-id 4 \
    --tp-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 54 60 66 72 78 84 90 108 114 120 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 120 --random-output-len 2000 --random-input-len 2000 --num-prompts 120 --random-range-ratio 1

Qwen3-30B-A3B 6K-1_5K 10ms on A3 1 Cards Mixed Mode

Model: Qwen3-30B-A3B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 10ms

Model Deployment

Command
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu \
    --max-running-requests 16 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --chunked-prefill-size -1 --max-prefill-tokens 35000  \
    --tp-size 2 --mem-fraction-static 0.6 --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving  --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16

Qwen3-30B-A3B 1K-0_3K 7ms on A3 1 Cards Mixed Mode

Model: Qwen3-30B-A3B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 7ms

Model Deployment

Command
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=400
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7339 --trust-remote-code --nnodes 1 --node-rank 0  \
    --attention-backend ascend --device npu \
    --max-running-requests 8 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \
    --chunked-prefill-size -1 --max-prefill-tokens 35000  \
    --tp-size 2 --mem-fraction-static 0.7 --cuda-graph-bs 1 2 3 4 5 6 7 8 --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving  --dataset-name random --backend sglang --host 127.0.0.1 --port 7339 --random-range-ratio 1 --max-concurrency 8 --random-output-len 300 --random-input-len 1024 --num-prompts 8

Qwen3-Next 1K-0_3K 14_21ms on A3 2 Cards Mixed Mode

Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 1K+0.3K TPOT: 14.21ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330
export DEEPEP_NORMAL_LONG_SEQ_ROUND=5
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1

export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=1

export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=2000
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
    --page-size 128 \
    --tp-size 4 \
    --trust-remote-code \
    --attention-backend ascend \
    --device npu \
    --watchdog-timeout 9000 \
    --host 127.0.0.1 --port 6699 \
    --mem-fraction-static 0.75 \
    --disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \
    --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  --speculative-draft-model-quantization  unquant \
    --chunked-prefill-size -1 --max-running-requests 312 \
    --cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \
    --mamba-ssm-dtype bfloat16 \
    --base-gpu-id 0 \
    --speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \
    --moe-a2a-backend deepep --deepep-mode auto \

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving  --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 300 --random-input-len 1024 --num-prompts 16

Qwen3-Next 6K-1_5K 15_62ms on A3 2 Cards Mixed Mode

Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 6K+1.5K TPOT: 15.62ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=330
export DEEPEP_NORMAL_LONG_SEQ_ROUND=5
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3000
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1

export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=1

export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1

MODEL_PATH=xxx

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export HCCL_BUFFSIZE=2000
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
    --page-size 128 \
    --tp-size 4 \
    --trust-remote-code \
    --attention-backend ascend \
    --device npu \
    --watchdog-timeout 9000 \
    --host 127.0.0.1 --port 6699 \
    --mem-fraction-static 0.75 \
    --disable-radix-cache --max-prefill-tokens 14080 --context-length 26384 \
    --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  --speculative-draft-model-quantization  unquant \
    --chunked-prefill-size -1 --max-running-requests 312 \
    --cuda-graph-bs 2 4 16 32 48 64 80 96 128 140 156 \
    --mamba-ssm-dtype bfloat16 \
    --base-gpu-id 0 \
    --speculative-draft-model-path /home/weights/Qwen3-Next-80B-A3B-Instruct \
    --moe-a2a-backend deepep --deepep-mode auto \

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving  --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 16 --random-output-len 1500 --random-input-len 6144 --num-prompts 16

Qwen3-14B 3_5K-1_5K 9ms on A3 1 Cards Mixed Mode

Model: Qwen3-14B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 9ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export ASCEND_USE_FIA=0
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu \
    --disable-radix-cache --mem-fraction-static 0.8 \
    --tp-size 1 --dp-size 1 \
    --sampling-backend ascend --max-running-requests 8 \
    --served-model-name Qwen3-14B \
    --chunked-prefill-size -1 \
    --cuda-graph-bs 8 \
    --dtype bfloat16 \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --schedule-conservativeness 0.01

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 8 --random-range-ratio 1

Qwen3-14B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode

Model: Qwen3-14B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export ASCEND_USE_FIA=0
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu \
    --disable-radix-cache --mem-fraction-static 0.89 \
    --tp-size 1 --dp-size 2 \
    --sampling-backend ascend --max-running-requests 144 \
    --max-prefill-tokens 12288 \
    --served-model-name Qwen3-14B \
    --chunked-prefill-size -1 \
    --cuda-graph-bs 8 16 32 44 48 50 52 \
    --dtype bfloat16 \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
    --schedule-conservativeness 0.01

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 144 --random-output-len 1500 --random-input-len 3500 --num-prompts 576 --random-range-ratio 1

Qwen3-8B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode

Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=50
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu \
    --disable-radix-cache --mem-fraction-static 0.9 \
    --tp-size 1 \
    --max-running-requests 70 \
    --max-prefill-tokens 16384 \
    --served-model-name Qwen3-8B \
    --chunked-prefill-size 16384 \
    --cuda-graph-bs 8 12 24 36 48 51 55 60 63 64 66 68 70 \
    --dtype bfloat16 \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 64 --random-output-len 1500 --random-input-len 3500 --num-prompts 256 --random-range-ratio 1

Qwen3-8B 3_5K-1_5K 5ms on A3 1 Cards Mixed Mode

Model: Qwen3-8B Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 5ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

MODEL_PATH=xxx

export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

python -m sglang.launch_server --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 7239 --trust-remote-code --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu \
    --disable-radix-cache --mem-fraction-static 0.894 \
    --tp-size 2 \
    --max-running-requests 1 \
    --max-prefill-tokens 16384 \
    --served-model-name Qwen3-8B \
    --chunked-prefill-size -1 \
    --cuda-graph-bs 1 \
    --dtype bfloat16 \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 --speculative-draft-model-path xxx \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 7239 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 4 --random-range-ratio 1

Qwen3-Next 3_5K-1_5K 20ms on A3 1 Cards Mixed Mode

Model: Qwen3-Next-80B-A3B-Instruct Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 20ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=400
export DEEPEP_NORMAL_LONG_SEQ_ROUND=10
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export HCCL_OP_EXPANSION_MODE="AIV"
export TASK_QUEUE_ENABLE=1
export ASCEND_USE_FIA=1
export SGLANG_NPU_USE_MULTI_STREAM=0
export SGLANG_WARMUP_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export FORCE_DRAFT_MODEL_NON_QUANT=1
export HCCL_BUFFSIZE=2000
export ZBCCL_LOCAL_MEM_SIZE=60416
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0

export ZBCCL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export ZBCCL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export ZBCCL_ENABLE_GRAPH=1

export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

MODEL_PATH=xxx

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
    --page-size 128 \
    --tp-size 2 \
    --trust-remote-code \
    --attention-backend ascend \
    --device npu \
    --watchdog-timeout 9000 \
    --host 127.0.0.1 --port 6699 \
    --mem-fraction-static 0.85 \
    --disable-radix-cache --max-prefill-tokens 28672 --context-length 26384 --max-total-tokens 122304 \
    --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-quantization unquant \
    --chunked-prefill-size -1 --max-running-requests 2 \
    --cuda-graph-bs 2 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-draft-model-path /path/to/Qwen3-Next-80B-A3B-Instruct

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 1

Qwen3.5-27B 3_5K-1_5K 20ms on A3 2 Cards Mixed Mode

Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 20ms

Model Deployment

Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY

# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

export ASCEND_LAUNCH_BLOCKING=1
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=3000
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_NPU_PROFILING=0
export SGLANG_DISAGGEGATION_WAITING_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0

MODEL_PATH=xxx

python -m sglang.launch_server --model-path ${MODEL_PATH} \
    --attention-backend ascend \
    --host 127.0.0.1 --port 6699 \
    --device npu \
    --tp-size 4\
    --trust-remote-code \
    --watchdog-timeout 9000 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 186000 \
    --enable-prefill-delayer \
    --prefill-delayer-max-delay-passes 200 \
    --disable-radix-cache \
    --mem-fraction-static 0.94 \
    --max-total-tokens 700000 \
    --max-running-requests 38 \
    --max-mamba-cache-size 200 \
    --quantization modelslim \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --cuda-graph-bs 1 2 4 8 12 18 24 32 34 36 38 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 6699 --dataset-name random  --max-concurrency 38 --num-prompts 152 --random-range-ratio 1  --random-output-len 1500 --random-input-len 3500

Qwen3.5-27B 16K-1K 20ms on A3 1 Cards Mixed Mode

Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 16K+1K TPOT: 20ms

Model Deployment

Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100

# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=8,9

MODEL_PATH=xxx

sglang serve --model-path ${MODEL_PATH} \
    --attention-backend ascend \
    --device npu \
    --tp-size 2 --nnodes 1 --node-rank 0 \
    --chunked-prefill-size -1 --max-prefill-tokens 65000 \
    --disable-radix-cache \
    --trust-remote-code \
    --host 127.0.0.1 --max-running-requests 32 --max-mamba-cache-size 32 \
    --mem-fraction-static 0.85 \
    --port 8001 \
    --cuda-graph-bs 2 3 4 5 6 \
    --enable-multimodal \
    --quantization modelslim \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 310000 \
    --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8001 --dataset-name random  --max-concurrency 32 --num-prompts 128 --random-range-ratio 1  --random-output-len 1000 --random-input-len 16000

Qwen3.5-27B 64K-1K 20ms on A3 1 Cards Mixed Mode

Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 64K+1K TPOT: 20ms

Model Deployment

Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY

# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_NPU_PROFILING=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=4,5

MODEL_PATH=xxx

python -m sglang.launch_server --model-path ${MODEL_PATH} \
        --attention-backend ascend \
        --device npu \
        --tp-size 2 --nnodes 1 --node-rank 0 \
        --chunked-prefill-size -1 --max-prefill-tokens 130000 \
        --disable-radix-cache \
        --trust-remote-code \
        --host 127.0.0.1 --max-running-requests 32 --max-mamba-cache-size 18 \
        --mem-fraction-static 0.5 \
        --port 8004 \
        --cuda-graph-bs 2 3 4 \
        --enable-multimodal \
        --quantization modelslim \
        --mm-attention-backend ascend_attn \
        --dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 280000 \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8004 --dataset-name random  --max-concurrency 9 --num-prompts 36 --random-range-ratio 1  --random-output-len 1000 --random-input-len 64000

Qwen3.5-27B 3_5K-1_5K 50ms on A3 1 Cards Mixed Mode

Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 1Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY

# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100

MODEL_PATH=xxx

python -m sglang.launch_server --model-path ${MODEL_PATH} \
    --attention-backend ascend \
    --device npu \
    --tp-size 2 --nnodes 1 --node-rank 0 \
    --chunked-prefill-size -1 --max-prefill-tokens 60000 \
    --disable-radix-cache \
    --trust-remote-code \
    --host 127.0.0.1 --max-running-requests 48 --max-mamba-cache-size 60 \
    --mem-fraction-static 0.7 \
    --port 8000 \
    --cuda-graph-bs 2 8 16 32 48 \
    --enable-multimodal \
    --quantization modelslim \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8000 --dataset-name random  --max-concurrency 48 --num-prompts 192 --random-range-ratio 1  --random-output-len 1500 --random-input-len 3500

Qwen3.5-27B 16K-1K 50ms on A3 2 Cards Mixed Mode

Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 16K+1K TPOT: 50ms

Model Deployment

Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY

# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=30
# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

MODEL_PATH=xxx

python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
    --attention-backend ascend \
    --device npu \
    --tp-size 4 --nnodes 1 --node-rank 0 \
    --chunked-prefill-size -1 --max-prefill-tokens 50000 \
    --disable-radix-cache \
    --trust-remote-code \
    --host 127.0.0.1 --max-running-requests 28 --max-mamba-cache-size 50 \
    --mem-fraction-static 0.7 \
    --port 8001 \
    --cuda-graph-bs 2 8 12 16 20 24 28\
    --enable-multimodal \
    --quantization modelslim \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8001 --dataset-name random  --max-concurrency 28 --num-prompts 152 --random-range-ratio 1  --random-output-len 1000 --random-input-len 16000

Qwen3.5-27B 64K-1K 50ms on A3 2 Cards Mixed Mode

Model: Eco-Tech/Qwen3.5-27B-w8a8-mtp Hardware: Atlas 800I A3 2Card DeployMode: PD Mixed Dataset: random Input Output Length: 64K+1K TPOT: 50ms

Model Deployment

Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY

# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
# on-demand set device
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7

MODEL_PATH=xxx

python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
    --attention-backend ascend \
    --device npu \
    --tp-size 4 --nnodes 1 --node-rank 0 \
    --chunked-prefill-size -1 --max-prefill-tokens 200000 \
    --disable-radix-cache \
    --trust-remote-code \
    --host 127.0.0.1 --max-running-requests 32 --max-mamba-cache-size 22 \
    --mem-fraction-static 0.5 \
    --port 9000 \
    --cuda-graph-bs 2 4 8 11 12 13 \
    --enable-multimodal \
    --quantization modelslim \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 850000 \
    --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 9000 --dataset-name random  --max-concurrency 9 --num-prompts 36 --random-range-ratio 1  --random-output-len 1000 --random-input-len 64000

Qwen3.5-397B-A17B 3_5K-1_5K 22ms on A3 8 Cards Mixed Mode

Model: Qwen3.5-397B-A17B Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 22ms

Model Deployment

Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export ASCEND_USE_FIA=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export HCCL_BUFFSIZE=3000
export DEEPEP_NORMAL_LONG_SEQ_ROUND=32
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3584
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MULTI_STREAM=1

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ZBAL_LOCAL_MEM_SIZE=58624
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://127.0.0.1:24669"
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export ZBAL_ENABLE_GRAPH=1

MODEL_PATH=xxx

python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 16 \
--chunked-prefill-size -1 --max-prefill-tokens 35000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 --max-running-requests 160 \
--mem-fraction-static 0.8 \
--port 6699 \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 \
--quantization modelslim \
--enable-multimodal --moe-a2a-backend deepep --deepep-mode auto \
--mm-attention-backend ascend_attn \
--dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 128000 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--dp-size 8 --enable-dp-attention --enable-dp-lm-head \
--enable-prefill-delayer --prefill-delayer-max-delay-passes 100

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 120 --random-output-len 1500 --random-input-len 3500 --num-prompts 480

Qwen3.5-397B-A17B 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode

Model: Qwen3.5-397B-A17B Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export ASCEND_USE_FIA=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export HCCL_BUFFSIZE=3000
export DEEPEP_NORMAL_LONG_SEQ_ROUND=32
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3584
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MULTI_STREAM=1

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ZBAL_LOCAL_MEM_SIZE=59648
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://127.0.0.1:24669"
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
export ZBAL_ENABLE_GRAPH=1

MODEL_PATH=xxx

python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 16 \
--chunked-prefill-size -1 --max-prefill-tokens 17500 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 --max-running-requests 432 \
--mem-fraction-static 0.75 \
--port 6699 \
--cuda-graph-bs 2 4 6 8 12 16 20 24 28 32 36 40 44 48 52 56 \
--quantization modelslim \
--enable-multimodal --moe-a2a-backend deepep --deepep-mode auto \
--mm-attention-backend ascend_attn \
--dtype bfloat16 --mamba-ssm-dtype bfloat16 --max-total-tokens 280000 \
--dp-size 8 --enable-dp-attention --enable-dp-lm-head \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--speculative-draft-model-quantization unquant \
--enable-prefill-delayer --prefill-delayer-max-delay-passes 200

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 352 --random-output-len 1500 --random-input-len 3500 --num-prompts 1408

MiniMax-M2.5 3_5K-1_5K Low Latency on A3 8 Cards Mixed Mode

Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

export HCCL_OP_EXPANSION_MODE=AIV
export TASK_QUEUE_ENABLE=1

export HCCL_BUFFSIZE=1500
export ASCEND_USE_FIA=1
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=224000

MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3

python -m sglang.launch_server \
   --model-path $MODEL_PATH \
   --host 127.0.0.1 \
   --port 32001 \
   --tp-size 16 \
   --dp-size 16 \
   --enable-dp-attention \
   --mem-fraction-static 0.75 \
   --max-running-requests 128 \
   --disable-radix-cache \
   --chunked-prefill-size -1 --max-prefill-token 8192 \
   --cuda-graph-bs 2 4 6 8 \
   --moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
   --speculative-algorithm EAGLE3 \
   --speculative-draft-model-path $EAGLE_MODEL_PATH \
   --speculative-num-steps 3 \
   --speculative-eagle-topk 1 \
   --speculative-num-draft-tokens 4 \
   --speculative-draft-model-quantization unquant \
   --dtype bfloat16 \
   --tokenizer-worker-num 2 \
   --prefill-delayer-max-delay-passes 500 \
   --enable-prefill-delayer

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32001 --random-input-len 3500 --random-output-len 1500 --num-prompts 320 --random-range-ratio 1 --max-concurrency 80

MiniMax-M2.5 128K-1K Low Latency on A3 8 Cards Mixed Mode

Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 128K+1K

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

export TASK_QUEUE_ENABLE=1

export ASCEND_USE_FIA=1
export HCCL_BUFFSIZE=1600
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_NPU_DEEPEP_USE_FUSED_MOE_DECODE=1
export SGLANG_NPU_FUSEEP_DECODE_ONLY=1

MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3

python -m sglang.launch_server \
   --model-path $MODEL_PATH \
   --host 127.0.0.1 \
   --port 32000 \
   --tp-size 16 \
   --dp-size 2 \
   --enable-dp-attention \
   --prefill-delayer-max-delay-passes 100 \
   --enable-prefill-delayer \
   --mem-fraction-static 0.65 \
   --max-running-requests 8 \
   --chunked-prefill-size -1 --max-prefill-token 130000 \
   --cuda-graph-bs 1 2 4 \
   --moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
   --speculative-algorithm EAGLE3 \
   --speculative-draft-model-path $EAGLE_MODEL_PATH \
   --speculative-num-steps 3 \
   --speculative-eagle-topk 1 \
   --speculative-num-draft-tokens 4 \
   --speculative-draft-model-quantization unquant \
   --dtype bfloat16 \
   --trust-remote-code \
   --tokenizer-worker-num 8

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32000 --random-input-len 131072 --random-output-len 1024 --num-prompts 8 --random-range-ratio 1 --max-concurrency 2

MiniMax-M2.5 3_5K-1_5K High Throughput on A3 8 Cards Mixed Mode

Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

export HCCL_OP_EXPANSION_MODE=AIV
export TASK_QUEUE_ENABLE=1

export HCCL_BUFFSIZE=800
export ASCEND_USE_FIA=1
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=204800

MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3

python -m sglang.launch_server \
   --model-path $MODEL_PATH \
   --host 127.0.0.1 \
   --port 32001 \
   --tp-size 16 \
   --enable-dp-attention \
   --dp-size 16 \
   --mem-fraction-static 0.75 \
   --max-running-requests 480 \
   --disable-radix-cache \
   --prefill-delayer-max-delay-passes 500 \
   --enable-prefill-delayer \
   --chunked-prefill-size -1 --max-prefill-token 8192 \
   --cuda-graph-bs 8 16 24 32 48 64 80 \
   --moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
   --speculative-algorithm EAGLE3 \
   --speculative-draft-model-path $EAGLE_MODEL_PATH \
   --speculative-num-steps 3 \
   --speculative-eagle-topk 1 \
   --speculative-num-draft-tokens 4 \
   --speculative-draft-model-quantization unquant \
   --dtype bfloat16

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32001 --random-input-len 3500 --random-output-len 1500 --num-prompts 1280 --random-range-ratio 1 --max-concurrency 320

MiniMax-M2.5 64K-1K High Throughput on A3 8 Cards Mixed Mode

Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 64K+1K

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

export TASK_QUEUE_ENABLE=1

export ASCEND_USE_FIA=1
export HCCL_BUFFSIZE=1600
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_NPU_DEEPEP_USE_FUSED_MOE_DECODE=1
export SGLANG_NPU_FUSEEP_DECODE_ONLY=1

MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3

python -m sglang.launch_server \
   --model-path $MODEL_PATH \
   --host 127.0.0.1 \
   --port 32000 \
   --tp-size 16 \
   --dp-size 2 \
   --enable-dp-attention \
   --prefill-delayer-max-delay-passes 100 \
   --enable-prefill-delayer \
   --mem-fraction-static 0.65 \
   --max-running-requests 72 \
   --chunked-prefill-size -1 --max-prefill-token 180000 \
   --cuda-graph-bs 8 16 24 32 40 \
   --moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
   --speculative-algorithm EAGLE3 \
   --speculative-draft-model-path $EAGLE_MODEL_PATH \
   --speculative-num-steps 3 \
   --speculative-eagle-topk 1 \
   --speculative-num-draft-tokens 4 \
   --speculative-draft-model-quantization unquant \
   --dtype bfloat16 \
   --trust-remote-code \
   --tokenizer-worker-num 8

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32000 --random-input-len 65536 --random-output-len 1024 --num-prompts 144 --random-range-ratio 1 --max-concurrency 36

MiniMax-M2.5 128K-1K High Throughput on A3 8 Cards Mixed Mode

Model: MiniMax-M2.5 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 128K+1K

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export TASK_QUEUE_ENABLE=1

export ASCEND_USE_FIA=1
export HCCL_BUFFSIZE=1600
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_NPU_DEEPEP_USE_FUSED_MOE_DECODE=1
export SGLANG_NPU_FUSEEP_DECODE_ONLY=1

MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3

python -m sglang.launch_server \
   --model-path $MODEL_PATH \
   --host 127.0.0.1 \
   --port 32000 \
   --tp-size 16 \
   --dp-size 2 \
   --enable-dp-attention \
   --prefill-delayer-max-delay-passes 100 \
   --enable-prefill-delayer \
   --mem-fraction-static 0.65 \
   --max-running-requests 36 \
   --chunked-prefill-size -1 --max-prefill-token 130000 \
   --cuda-graph-bs 8 16 24 \
   --moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
   --speculative-algorithm EAGLE3 \
   --speculative-draft-model-path $EAGLE_MODEL_PATH \
   --speculative-num-steps 3 \
   --speculative-eagle-topk 1 \
   --speculative-num-draft-tokens 4 \
   --speculative-draft-model-quantization unquant \
   --dtype bfloat16 \
   --trust-remote-code \
   --tokenizer-worker-num 8

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32000 --random-input-len 131072 --random-output-len 1024 --num-prompts 128 --random-range-ratio 1 --max-concurrency 32

MiniMax-M2.5 64K-1K High Throughput on A3 4 Cards Mixed Mode

Model: MiniMax-M2.5 Hardware: Atlas 800I A3 4Card DeployMode: PD Mixed Dataset: random Input Output Length: 64K+1K

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export TASK_QUEUE_ENABLE=1

export ASCEND_USE_FIA=0
export HCCL_BUFFSIZE=1600
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
export SGLANG_NPU_FUSED_MOE_MODE=2
export SGLANG_NPU_DEEPEP_USE_FUSED_MOE_DECODE=1
export SGLANG_NPU_FUSEEP_DECODE_ONLY=1

MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot
EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3

python -m sglang.launch_server \
   --model-path $MODEL_PATH \
   --host 127.0.0.1 \
   --port 32000 \
   --tp-size 8 \
   --enable-dp-attention \
   --prefill-delayer-max-delay-passes 500 \
   --enable-prefill-delayer \
   --mem-fraction-static 0.65 \
   --max-running-requests 36 \
   --chunked-prefill-size -1 --max-prefill-token 150000 \
   --cuda-graph-bs 8 16 24 32 40 \
   --moe-a2a-backend ascend_fuseep --deepep-mode auto --quantization modelslim \
   --speculative-algorithm EAGLE3 \
   --speculative-draft-model-path $EAGLE_MODEL_PATH \
   --speculative-num-steps 3 \
   --speculative-eagle-topk 1 \
   --speculative-num-draft-tokens 4 \
   --speculative-draft-model-quantization unquant \
   --dtype bfloat16 \
   --trust-remote-code \
   --tokenizer-worker-num 8

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 32000 --random-input-len 65536 --random-output-len 1024 --num-prompts 144 --random-range-ratio 1 --max-concurrency 36

MiniMax-M2.5 64K-1K High Throughput on A3 16 Cards Disaggregation Mode

Model: MiniMax-M2.5 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 64K+1K

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32

export ASCEND_MF_STORE_URL="tcp://your_prefill_ip:24667"

P_IP=('your_prefill_ip')
D_IP=('your_decode_ip')
D_MASTER="${D_IP[0]}:8001"
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot

EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic
        export ASCEND_USE_FIA=1
        export HCCL_BUFFSIZE=2500
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export TASK_QUEUE_ENABLE=2
        export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
        export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 32000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
        --tp-size 16 --mem-fraction-static 0.43 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 128 \
        --chunked-prefill-size -1 --max-prefill-tokens 58000 --moe-a2a-backend deepep --deepep-mode normal \
        --tokenizer-worker-num 16 \
        --dp-size 2 --enable-dp-attention --dtype bfloat16 --load-balance-method round_robin \
        --speculative-algorithm EAGLE3 \
        --speculative-draft-model-path $EAGLE_MODEL_PATH \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --speculative-draft-model-quantization unquant --skip-server-warmup
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export HCCL_BUFFSIZE=1600
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export SGLANG_NPU_FUSED_MOE_MODE=2
        export SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS=96

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --cuda-graph-bs 8 16 24 32 40 \
        --port 33000 --trust-remote-code \
        --tp-size 16 --mem-fraction-static 0.76 --attention-backend ascend --device npu --quantization modelslim \
        --nnodes 1 --node-rank $i --dist-init-addr $D_MASTER \
        --disaggregation-transfer-backend ascend --max-running-requests 80 \
        --chunked-prefill-size -1 --moe-a2a-backend ascend_fuseep --deepep-mode low_latency \
        --tokenizer-worker-num 16 \
        --dp-size 2 --enable-dp-attention --dtype bfloat16 \
        --load-balance-method round_robin \
        --speculative-algorithm EAGLE3 \
        --speculative-draft-model-path $EAGLE_MODEL_PATH \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --speculative-draft-model-quantization unquant

        NODE_RANK=$i
        break
    fi
done
Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy round_robin \
    --prefill http://your_prefill_ip:32000 8998 \
    --decode http://your_decode_ip:33000 \
    --host 127.0.0.1 \
    --mini-lb \
    --port 6688

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --random-input-len 65536 --random-output-len 1024 --num-prompts 640 --random-range-ratio 1 --max-concurrency 160

MiniMax-M2.5 128K-1K High Throughput on A3 16 Cards Disaggregation Mode

Model: MiniMax-M2.5 Hardware: Atlas 800I A3 16Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 128K+1K

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32

export ASCEND_MF_STORE_URL="tcp://your_prefill_ip:24667"

P_IP=('your_prefill_ip')
D_IP=('your_decode_ip')
D_MASTER="${D_IP[0]}:8001"
MODEL_PATH=/path/to/MiniMax-M2.5-w8a8-QuaRot

EAGLE_MODEL_PATH=/path/to/MiniMax-M2.5-eagle-model
export PYTHONPATH=${EAGLE_MODEL_PATH}:$PYTHONPATH
export SGLANG_EXTERNAL_MODEL_PACKAGE=custom_eagle3

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic
        export ASCEND_USE_FIA=1
        export HCCL_BUFFSIZE=2500
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export TASK_QUEUE_ENABLE=2
        export DEEPEP_NORMAL_LONG_SEQ_ROUND=64
        export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=2048
        export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1
        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 32000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
        --tp-size 16 --mem-fraction-static 0.43 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 128 \
        --chunked-prefill-size -1 --max-prefill-tokens 130000 --moe-a2a-backend deepep --deepep-mode normal \
        --tokenizer-worker-num 16 \
        --dp-size 2 --enable-dp-attention --dtype bfloat16 --load-balance-method round_robin \
        --speculative-algorithm EAGLE3 \
        --speculative-draft-model-path $EAGLE_MODEL_PATH \
        --speculative-num-steps 2 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 3 \
        --speculative-draft-model-quantization unquant --skip-server-warmup
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export HCCL_BUFFSIZE=1600
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=640
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export SGLANG_NPU_FUSED_MOE_MODE=2
        export SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS=96

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --cuda-graph-bs 2 4 8 \
        --port 33000 --trust-remote-code \
        --tp-size 16 --mem-fraction-static 0.76 --attention-backend ascend --device npu --quantization modelslim \
        --nnodes 1 --node-rank $i --dist-init-addr $D_MASTER \
        --disaggregation-transfer-backend ascend --max-running-requests 80 \
        --chunked-prefill-size -1 --moe-a2a-backend ascend_fuseep --deepep-mode low_latency \
        --tokenizer-worker-num 8 \
        --dp-size 2 --enable-dp-attention --dtype bfloat16 \
        --load-balance-method round_robin \
        --speculative-algorithm EAGLE3 \
        --speculative-draft-model-path $EAGLE_MODEL_PATH \
        --speculative-num-steps 2 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 3 \
        --speculative-draft-model-quantization unquant

        NODE_RANK=$i
        break
    fi
done
Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy round_robin \
    --prefill http://your_prefill_ip:32000 8998 \
    --decode http://your_decode_ip:33000 \
    --host 127.0.0.1 \
    --mini-lb \
    --port 6688

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6688 --random-input-len 131072 --random-output-len 1024 --num-prompts 192 --random-range-ratio 1 --max-concurrency 48

Kimi K2.5 w4a8 3_5K-1_5K 20ms on A3 8 Cards Mixed Mode

Model: Kimi-K2.5-w4a8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 20ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export STREAMS_PER_DEVICE=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=48
export HCCL_BUFFSIZE=1200
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200

MODEL_PATH=xxx
DRAFT_PATH=xxx

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH --quantization modelslim --dtype bfloat16 \
    --model-loader-extra-config '{"enable_multithread_load": true}' \
    --host 0.0.0.0 --port 6699 \
    --trust-remote-code --device npu --attention-backend ascend \
    --tp-size 16 --base-gpu-id 0 --mem-fraction-static 0.78 --max-running-requests 64 \
    --chunked-prefill-size 32768 --context-length 8192 --max-prefill-tokens 16384 \
    --enable-multimodal --mm-attention-backend ascend_attn --sampling-backend ascend \
    --enable-dp-attention --dp-size 16 \
    --moe-a2a-backend deepep --deepep-mode auto \
    --cuda-graph-bs 1 2 3 4 --disable-radix-cache \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_PATH \
    --speculative-num-steps 4 --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 5 \
    --speculative-draft-model-quantization unquant

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 64 --random-output-len 1500 --random-input-len 3500 --num-prompts 64

Kimi K2.5 w4a8 3_5K-1_5K 50ms on A3 8 Cards Mixed Mode

Model: Kimi-K2.5-w4a8 Hardware: Atlas 800I A3 8Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export STREAMS_PER_DEVICE=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=96
export HCCL_BUFFSIZE=1200
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200

MODEL_PATH=xxx
DRAFT_PATH=xxx

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH --quantization modelslim --dtype bfloat16 \
    --model-loader-extra-config '{"enable_multithread_load": true}' \
    --host 0.0.0.0 --port 6699 \
    --trust-remote-code --device npu --attention-backend ascend \
    --tp-size 16 --base-gpu-id 0 --mem-fraction-static 0.7 --max-running-requests 120 \
    --chunked-prefill-size 32768 --context-length 8192 --max-prefill-tokens 16384 \
    --enable-multimodal --mm-attention-backend ascend_attn --sampling-backend ascend \
    --enable-dp-attention --dp-size 16 \
    --moe-a2a-backend deepep --deepep-mode auto \
    --cuda-graph-bs 1 2 4 8 12 16 24 32 48 64 96 120 --disable-radix-cache \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_PATH \
    --speculative-num-steps 4 --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 5 \
    --speculative-draft-model-quantization unquant

Benchmark

We tested it based on the RANDOM dataset.
Command
python3 -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 6699 --random-range-ratio 1 --max-concurrency 120 --random-output-len 1500 --random-input-len 3500 --num-prompts 120

GLM-5.1 3_5K-1_5K 41ms on A3 16 Cards Mixed Mode

Model: GLM-5.1
The model is quantized, with MTP layers excluded from quantization.
Hardware: Atlas 800I A3 16Card DeployMode: PD Mixed Dataset: random Input Output Length: 3.5K+1.5K TPOT: 41ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTHONPATH=/path/to/sglang/python:$PYTHONPATH

export STREAMS_PER_DEVICE=32

export HCCL_SOCKET_IFNAME=your_nic
export GLOO_SOCKET_IFNAME=your_nic

MODEL_PATH=/path/to/GLM-5.1-w4a8

P_IP=('your ip1' 'your ip2')
P_MASTER="${P_IP[0]}:4567"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`

echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1

for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export HCCL_BUFFSIZE=2500
        python -m sglang.launch_server \
        --model-path $MODEL_PATH \
        --attention-backend ascend \
        --device npu \
        --dist-init-addr ${P_IP[0]}:5000 \
        --tp-size 32 --nnodes 2 --node-rank $i \
        --dp-size 16 --enable-dp-attention \
        --chunked-prefill-size 131072 --max-prefill-tokens 280000 \
        --trust-remote-code \
        --host 127.0.0.1 \
        --mem-fraction-static 0.65 \
        --port 8001 \
        --served-model-name glm-5 \
        --cuda-graph-max-bs 8 \
        --max-running-requests 128 \
        --quantization modelslim \
        --speculative-draft-model-quantization unquant \
        --moe-a2a-backend deepep --deepep-mode auto \
        --load-balance-method round_robin \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
        NODE_RANK=$i
        break
    fi
done
Quantization Configuration:
  • --quantization modelslim is only applicable for quantized models.
  • --speculative-draft-model-quantization unquant should be configured based on model specs, turned on for non-quantized MTP layers.

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 8001 --random-range-ratio 1 --random-output-len 1500 --random-input-len 3500 --num-prompts 320

GLM-5.1 16K-1K 23ms on A3 32 Cards Disaggregation Mode

Model: GLM-5.1 Hardware: Atlas 800I A3 32Card DeployMode: PD Disaggregation Dataset: random Input Output Length: 16K+1K TPOT: 23ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTHONPATH=/path/to/sglang/python:$PYTHONPATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32

export ASCEND_MF_STORE_URL="tcp://${P_IP[0]}:24707"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

P_IP=('your prefill ip1' 'your prefill ip2')
D_IP=('your decode ip1' 'your decode ip2')

MODEL_PATH=/path/to/GLM-5.1-w4a8

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export TASK_QUEUE_ENABLE=2
        export ENABLE_PROFILING=0
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic

        export HCCL_BUFFSIZE=8
        unset PYTORCH_NPU_ALLOC_CONF
        export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
        export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
        export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://${P_IP[0]}:24672"

        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port 8998 --dist-init-addr ${P_IP[0]}:5000 --trust-remote-code --nnodes 2 --node-rank $i \
        --tp-size 32 --mem-fraction-static 0.75 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 64 \
        --served-model-name glm-5 --chunked-prefill-size 524288 --max-prefill-tokens 180000 --moe-a2a-backend deepep --deepep-mode normal \
        --disable-shared-experts-fusion --disable-cuda-graph --dtype bfloat16 \
        --dp-size 4 --enable-dp-attention \
        --load-balance-method round_robin \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 8 \
        --enable-dp-lm-head --moe-dense-tp 1 \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=650
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export TASK_QUEUE_ENABLE=0
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic

        export SGLANG_NPU_USE_MULTI_STREAM=1

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8003 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --ep-size 32 \
        --mem-fraction-static 0.87 --max-running-requests 128 --attention-backend ascend --device npu --quantization modelslim \
        --served-model-name glm-5 --moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 180000 \
        --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16  --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
        NODE_RANK=$i
        break
    fi
done
Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy round_robin \
    --prefill http://your_prefill_ip1:8000 8998 \
    --decode http://your_decode_ip1:8003 \
    --host 127.0.0.1 \
    --port 6688

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 8003 --random-range-ratio 1 --random-output-len 1000 --random-input-len 16000 --num-prompts 192

GLM-5.1 64K-1K-90%_cache_hit 45ms on A3 48 Cards Disaggregation Mode

Model: GLM-5.1 Hardware: Atlas 800I A3 48Card DeployMode: PD Disaggregation Dataset: random (90% cache hit) Input Output Length: 64K+1K TPOT: 45ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTHONPATH=/path/to/sglang/python:$PYTHONPATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32

export ASCEND_MF_STORE_URL="tcp://${P_IP[0]}:24709"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200

P_IP=('your prefill ip1' 'your prefill ip2' 'your prefill ip3' 'your prefill ip4')
D_IP=('your decode ip1' 'your decode ip2')

MODEL_PATH=/path/to/GLM-5.1-w4a8

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export TASK_QUEUE_ENABLE=2
        export ENABLE_PROFILING=0
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic

        export ZBAL_HCCL_OP="send,recv"
        export HCCL_BUFFSIZE=128
        unset PYTORCH_NPU_ALLOC_CONF
        export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
        export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
        export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://${P_IP[$i]}:24691"

        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port $((8998 + i)) --trust-remote-code --nnodes 1 --node-rank 0 \
        --tp-size 4 --mem-fraction-static 0.72 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 16 \
        --served-model-name glm-5 --chunked-prefill-size 16384 --max-prefill-tokens 180000 --moe-a2a-backend deepep --deepep-mode normal \
        --disable-shared-experts-fusion --disable-cuda-graph --dtype bfloat16 \
        --speculative-draft-model-quantization unquant \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 4 \
        --enable-dp-lm-head --moe-dense-tp 1 \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
        --pp-size 4
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=300
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=40
        export TASK_QUEUE_ENABLE=0
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic

        export SGLANG_NPU_USE_MULTI_STREAM=1
        export SGLANG_LM_HEAD_TP=4

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8003 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --enable-dp-attention --ep-size 32 \
        --mem-fraction-static 0.85 --max-running-requests 320 --attention-backend ascend --device npu --quantization modelslim \
        --served-model-name glm-5 --moe-a2a-backend deepep --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 180000 \
        --tokenizer-worker-num 4 --disable-shared-experts-fusion --dtype bfloat16  --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
        --disaggregation-enable-decode-radix-cache
        NODE_RANK=$i
        break
    fi
done
Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy round_robin \
    --prefill http://your_prefill_ip1:8000 8998 \
    --prefill http://your_prefill_ip2:8000 8999 \
    --prefill http://your_prefill_ip3:8000 9000 \
    --prefill http://your_prefill_ip4:8000 9001 \
    --decode http://your_decode_ip1:8003 \
    --host 127.0.0.1 \
    --port 6688

Benchmark

We tested it based on the RANDOM dataset (90% cache hit), this dataset is generated through this tool.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 8003 --random-range-ratio 1 --random-output-len 1000 --random-input-len 64000 --num-prompts 192

GLM-5.1 128K-1K-90%_cache_hit 32ms on A3 48 Cards Disaggregation Mode

Model: GLM-5.1 Hardware: Atlas 800I A3 48Card DeployMode: PD Disaggregation Dataset: random (90% cache hit) Input Output Length: 128K+1K TPOT: 32ms

Model Deployment

Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH

export PYTHONPATH=/path/to/sglang/python:$PYTHONPATH

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32

export ASCEND_MF_STORE_URL="tcp://${P_IP[0]}:24709"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200

P_IP=('your prefill ip1' 'your prefill ip2')
P1_IP=('your prefill ip3' 'your prefill ip4')
D_IP=('your decode ip1' 'your decode ip2')

MODEL_PATH=/path/to/GLM-5.1-w4a8

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"

# prefill group 1
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export TASK_QUEUE_ENABLE=2
        export ENABLE_PROFILING=0
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic

        export ZBAL_HCCL_OP="send,recv"
        export HCCL_BUFFSIZE=128
        unset PYTORCH_NPU_ALLOC_CONF
        export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
        export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
        export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://${P_IP[0]}:24691"

        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port 8998 --trust-remote-code --nnodes 2 --node-rank $i --dist-init-addr ${P_IP[0]}:5000 \
        --tp-size 4 --mem-fraction-static 0.72 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 32 \
        --served-model-name glm-5 --chunked-prefill-size 16384 --max-prefill-tokens 180000 --moe-a2a-backend deepep --deepep-mode normal \
        --disable-shared-experts-fusion --disable-cuda-graph --dtype bfloat16 \
        --speculative-draft-model-quantization unquant \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 4 \
        --enable-dp-lm-head --moe-dense-tp 1 \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
        --pp-size 8
        NODE_RANK=$i
        break
    fi
done

# prefill group 2
for i in "${!P1_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P1_IP[$i]}" || "$LOCAL_HOST2" == "${P1_IP[$i]}" ]];
    then
        echo "${P1_IP[$i]}"
        export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
        export TASK_QUEUE_ENABLE=2
        export ENABLE_PROFILING=0
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic

        export ZBAL_HCCL_OP="send,recv"
        export HCCL_BUFFSIZE=128
        unset PYTORCH_NPU_ALLOC_CONF
        export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
        export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True
        export SGLANG_ZBAL_BOOTSTRAP_URL="tcp://${P1_IP[0]}:24691"

        python -m sglang.launch_server --model-path ${MODEL_PATH}  --disaggregation-mode prefill --host ${P1_IP[$i]} \
        --port 8000 --disaggregation-bootstrap-port 8999 --trust-remote-code --nnodes 2 --node-rank $i --dist-init-addr ${P1_IP[0]}:5000 \
        --tp-size 4 --mem-fraction-static 0.72 --attention-backend ascend --device npu --quantization modelslim \
        --disaggregation-transfer-backend ascend --max-running-requests 32 \
        --served-model-name glm-5 --chunked-prefill-size 16384 --max-prefill-tokens 180000 --moe-a2a-backend deepep --deepep-mode normal \
        --disable-shared-experts-fusion --disable-cuda-graph --dtype bfloat16 \
        --speculative-draft-model-quantization unquant \
        --enable-nsa-prefill-context-parallel \
        --nsa-prefill-cp-mode in-seq-split \
        --attn-cp-size 4 \
        --enable-dp-lm-head --moe-dense-tp 1 \
        --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
        --pp-size 8
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export SGLANG_SPEC_ENABLE_OVERLAP_REFLOW=1
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export HCCL_BUFFSIZE=200
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=24
        export TASK_QUEUE_ENABLE=0
        export HCCL_SOCKET_IFNAME=your_nic
        export GLOO_SOCKET_IFNAME=your_nic

        export SGLANG_NPU_USE_MULTI_STREAM=1

        python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
        --port 8003 --trust-remote-code --dist-init-addr ${D_IP[0]}:5000 --nnodes 2 --node-rank $i --tp-size 32 --dp-size 32 --enable-dp-attention --ep-size 32 \
        --mem-fraction-static 0.865 --max-running-requests 96 --attention-backend ascend --device npu --quantization modelslim \
        --served-model-name glm-5 --moe-a2a-backend deepep --deepep-mode low_latency \
        --cuda-graph-bs 1 2 3 4 5 6 --disaggregation-transfer-backend ascend --watchdog-timeout 9000  \
        --tokenizer-worker-num 32 --disable-shared-experts-fusion --dtype bfloat16  --load-balance-method round_robin \
        --speculative-draft-model-quantization unquant \
        --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
        --disaggregation-decode-enable-radix-cache
        NODE_RANK=$i
        break
    fi
done
Command
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy round_robin \
    --prefill http://your_prefill_ip1:8000 8998 \
    --prefill http://your_prefill_ip3:8000 8999 \
    --decode http://your_decode_ip1:8003 \
    --host 127.0.0.1 \
    --port 6688

Benchmark

We tested it based on the RANDOM dataset (90% cache hit), this dataset is generated through this tool.
Command
python -m sglang.bench_serving --dataset-name random --backend sglang --host 127.0.0.1 --port 8003 --random-range-ratio 1 --random-output-len 1000 --random-input-len 131072 --num-prompts 192