Qwen3-32B - SGLang Documentation

This page focuses on optimal configuration and benchmark results for Qwen3-32B on the Ascend NPU. For environment setup, model weight download, feature configuration, and deployment instructions, etc., see the Qwen3-32B Model Tutorial.On A3 each card has 2 dies, so --tp-size is twice the card count; see Ascend NPU Reference for details.

Low Latency

Model	Hardware	Cards	Deploy Mode	Dataset	TPOT	Quantization	Configuration
Qwen3-32B	Atlas 800I A3	8	PD Mixed	18k+4k	6ms	BF16	Optimal Configuration

High Throughput

Model	Hardware	Cards	Deploy Mode	Dataset	TPOT	Quantization	Configuration
Qwen3-32B	Atlas 800I A2	2	PD Mixed	3.5k+1.5k	50ms	W8A8 INT8	Optimal Configuration
Qwen3-32B	Atlas 800I A3	2	PD Mixed	3.5k+1.5k	50ms	W8A8 INT8	Optimal Configuration

Optimal Configuration

Qwen3-32B BF16 8P IN18K OUT4K 6ms

Model: Qwen3-32B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: BF16 Dataset: 18k+4k TPOT: 6ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=200
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --trust-remote-code \
    --nnodes 1 \
    --node-rank 0 \
    --attention-backend ascend \
    --device npu \
    --max-running-requests 1 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 65536 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 5 \
    --tp-size 16 \
    --mem-fraction-static 0.72 \
    --cuda-graph-bs 1 \
    --dtype bfloat16 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 1 \
    --random-input-len 18000 \
    --random-output-len 4000 \
    --num-prompts 1 \
    --random-range-ratio 1

Qwen3-32B W8A8 2P IN3K5 OUT1K5 50ms A2

Model: Qwen3-32B Hardware: Atlas 800I A2 Cards: 2 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 3.5k+1.5k TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_DEEPGEMM=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --trust-remote-code \
    --nnodes 1 \
    --node-rank 0 \
    --attention-backend ascend \
    --device npu \
    --quantization modelslim \
    --max-running-requests 101 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 35000 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --tp-size 4 \
    --mem-fraction-static 0.845 \
    --cuda-graph-bs 16 32 64 72 88 90 92 94 96 97 98 99 100 101 \
    --dtype bfloat16 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 100 \
    --random-input-len 3584 \
    --random-output-len 1536 \
    --num-prompts 400 \
    --random-range-ratio 1

Qwen3-32B W8A8 2P IN3K5 OUT1K5 50ms

Model: Qwen3-32B Hardware: Atlas 800I A3 Cards: 2 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 3.5k+1.5k TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_DEEPGEMM=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --trust-remote-code \
    --nnodes 1 \
    --node-rank 0 \
    --attention-backend ascend \
    --device npu \
    --quantization modelslim \
    --max-running-requests 101 \
    --disable-radix-cache \
    --speculative-draft-model-quantization unquant \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 35000 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --tp-size 4 \
    --mem-fraction-static 0.845 \
    --cuda-graph-bs 16 32 64 72 88 90 92 94 96 97 98 99 100 101 \
    --dtype bfloat16 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 100 \
    --random-input-len 3584 \
    --random-output-len 1536 \
    --num-prompts 400 \
    --random-range-ratio 1

​Low Latency

​High Throughput

​Optimal Configuration

​Qwen3-32B BF16 8P IN18K OUT4K 6ms

​Model Deployment

​Benchmark

​Qwen3-32B W8A8 2P IN3K5 OUT1K5 50ms A2

​Model Deployment

​Benchmark

​Qwen3-32B W8A8 2P IN3K5 OUT1K5 50ms

​Model Deployment

​Benchmark

Low Latency

High Throughput

Optimal Configuration

Qwen3-32B BF16 8P IN18K OUT4K 6ms

Model Deployment

Benchmark

Qwen3-32B W8A8 2P IN3K5 OUT1K5 50ms A2

Model Deployment

Benchmark

Qwen3-32B W8A8 2P IN3K5 OUT1K5 50ms

Model Deployment

Benchmark