Qwen3.5-397B - SGLang Documentation

This page focuses on optimal configuration and benchmark results for Qwen3.5-397B on the Ascend NPU. For environment setup, model weight download, feature configuration, and deployment instructions, etc., see the Qwen3.5-397B Model Tutorial.On A3 each card has 2 dies, so --tp-size is twice the card count; see Ascend NPU Reference for details.

Low Latency

Model	Hardware	Cards	Deploy Mode	Dataset	TPOT	Quantization	Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	128k+1k	20ms	W4A8 INT8	Optimal Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	16k+1k	20ms	W4A8 INT8	Optimal Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	3.5k+1.5k	20ms	W4A8 INT8	Optimal Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	64k+1k	20ms	W4A8 INT8	Optimal Configuration

High Throughput

Model	Hardware	Cards	Deploy Mode	Dataset	TPOT	Quantization	Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	128k+1k	50ms	W4A8 INT8	Optimal Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	128k+1k (90% prefix cache hit rate)	50ms	W4A8 INT8	Optimal Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	16k+1k	50ms	W4A8 INT8	Optimal Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	3.5k+1.5k	50ms	W4A8 INT8	Optimal Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	64k+1k	50ms	W4A8 INT8	Optimal Configuration
Qwen3.5-397B	Atlas 800I A3	8	PD Mixed	64k+1k (90% prefix cache hit rate)	50ms	W4A8 INT8	Optimal Configuration

Optimal Configuration

Qwen3.5-397B W4A8 8P IN128K OUT1K 20ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 128k+1k TPOT: 20ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096
export DEEPEP_NORMAL_LONG_SEQ_ROUND=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=0
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export SGLANG_ZBAL_LOCAL_MEM_SIZE=60672
export STREAMS_PER_DEVICE=32
export ZBAL_ENABLE_GRAPH=1
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 131072 \
    --prefill-max-requests 1 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 16 \
    --mem-fraction-static 0.6 \
    --cuda-graph-bs 2 3 4 5 6 8 10 12 14 16 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 3 \
    --random-input-len 131072 \
    --random-output-len 1024 \
    --num-prompts 3 \
    --random-range-ratio 1 \
    --request-rate inf \
    --warmup-requests 2

Qwen3.5-397B W4A8 8P IN128K OUT1K 50ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 128k+1k TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096
export DEEPEP_NORMAL_LONG_SEQ_ROUND=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=0
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export SGLANG_ZBAL_LOCAL_MEM_SIZE=60672
export STREAMS_PER_DEVICE=32
export ZBAL_ENABLE_GRAPH=1
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 131072 \
    --prefill-max-requests 1 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 16 \
    --mem-fraction-static 0.6 \
    --cuda-graph-bs 2 4 6 8 12 14 16 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 10 \
    --random-input-len 131072 \
    --random-output-len 1024 \
    --num-prompts 10 \
    --random-range-ratio 1 \
    --request-rate inf \
    --warmup-requests 8

Qwen3.5-397B W4A8 8P IN128K OUT1K PREFIX90 50ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 128k+1k (90% prefix cache hit rate) TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096
export DEEPEP_NORMAL_LONG_SEQ_ROUND=32
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=2200
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 131072 \
    --max-mamba-cache-size 320 \
    --prefill-max-requests 10 \
    --mamba-radix-cache-strategy extra_buffer \
    --trust-remote-code \
    --max-running-requests 64 \
    --mem-fraction-static 0.6 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 117964 = int(131072 * 0.9) is the shared prefix portion. --gsp-question-len 13107 = int(131072 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.

Command

python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 40 \
    --gsp-system-prompt-len 117964 \
    --gsp-question-len 13107 \
    --gsp-output-len 1024 \
    --max-concurrency 40 \
    --num-prompts 40 \
    --request-rate inf

Qwen3.5-397B W4A8 8P IN16K OUT1K 20ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 16k+1k TPOT: 20ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096
export DEEPEP_NORMAL_LONG_SEQ_ROUND=20
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=0
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export SGLANG_ZBAL_LOCAL_MEM_SIZE=59648
export STREAMS_PER_DEVICE=32
export ZBAL_ENABLE_GRAPH=1
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 50000 \
    --prefill-max-requests 4 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 48 \
    --mem-fraction-static 0.8 \
    --max-total-tokens 210000 \
    --cuda-graph-bs 2 4 6 8 10 12 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --dp-size 4 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 40 \
    --random-input-len 16384 \
    --random-output-len 1024 \
    --num-prompts 40 \
    --random-range-ratio 1 \
    --request-rate inf \
    --warmup-requests 32

Qwen3.5-397B W4A8 8P IN16K OUT1K 50ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 16k+1k TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096
export DEEPEP_NORMAL_LONG_SEQ_ROUND=20
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=0
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export SGLANG_ZBAL_LOCAL_MEM_SIZE=58624
export STREAMS_PER_DEVICE=32
export ZBAL_ENABLE_GRAPH=1
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 65536 \
    --prefill-max-requests 4 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 144 \
    --mem-fraction-static 0.8 \
    --max-total-tokens 635000 \
    --cuda-graph-bs 2 4 6 8 12 14 16 18 20 24 26 28 30 32 34 36 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --dp-size 4 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 132 \
    --random-input-len 16384 \
    --random-output-len 1024 \
    --num-prompts 132 \
    --random-range-ratio 1 \
    --request-rate inf \
    --warmup-requests 8

Qwen3.5-397B W4A8 8P IN3K5 OUT1K5 20ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 3.5k+1.5k TPOT: 20ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3584
export DEEPEP_NORMAL_LONG_SEQ_ROUND=6
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=0
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export SGLANG_ZBAL_LOCAL_MEM_SIZE=58624
export STREAMS_PER_DEVICE=32
export ZBAL_ENABLE_GRAPH=1
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 35000 \
    --max-total-tokens 128000 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 160 \
    --mem-fraction-static 0.8 \
    --cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --dp-size 8 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 160 \
    --random-input-len 3500 \
    --random-output-len 1500 \
    --num-prompts 160 \
    --random-range-ratio 1 \
    --request-rate inf \
    --warmup-requests 64

Qwen3.5-397B W4A8 8P IN3K5 OUT1K5 50ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 3.5k+1.5k TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3584
export DEEPEP_NORMAL_LONG_SEQ_ROUND=6
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=0
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export SGLANG_ZBAL_LOCAL_MEM_SIZE=59648
export STREAMS_PER_DEVICE=32
export ZBAL_ENABLE_GRAPH=1
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 17500 \
    --max-total-tokens 280000 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 432 \
    --mem-fraction-static 0.8 \
    --cuda-graph-bs 2 4 6 8 12 16 20 24 28 32 36 40 44 48 50 52 54 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --dp-size 8 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 432 \
    --random-input-len 3500 \
    --random-output-len 1500 \
    --num-prompts 432 \
    --random-range-ratio 1 \
    --request-rate inf \
    --warmup-requests 16

Qwen3.5-397B W4A8 8P IN64K OUT1K 20ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 64k+1k TPOT: 20ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096
export DEEPEP_NORMAL_LONG_SEQ_ROUND=20
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=0
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export SGLANG_ZBAL_LOCAL_MEM_SIZE=58672
export STREAMS_PER_DEVICE=32
export ZBAL_ENABLE_GRAPH=1
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 65536 \
    --prefill-max-requests 1 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 16 \
    --mem-fraction-static 0.6 \
    --max-total-tokens 1065000 \
    --cuda-graph-bs 2 4 6 8 10 12 14 16 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --dp-size 2 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 6 \
    --random-input-len 65536 \
    --random-output-len 1024 \
    --num-prompts 6 \
    --random-range-ratio 1 \
    --request-rate inf \
    --warmup-requests 6

Qwen3.5-397B W4A8 8P IN64K OUT1K 50ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 64k+1k TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096
export DEEPEP_NORMAL_LONG_SEQ_ROUND=20
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=0
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24669
export SGLANG_ZBAL_LOCAL_MEM_SIZE=58672
export STREAMS_PER_DEVICE=32
export ZBAL_ENABLE_GRAPH=1
export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 65536 \
    --prefill-max-requests 1 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 32 \
    --mem-fraction-static 0.6 \
    --max-total-tokens 1065000 \
    --cuda-graph-bs 2 4 6 8 12 14 16 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --dp-size 2 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the RANDOM dataset.

Command

python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 28 \
    --random-input-len 65536 \
    --random-output-len 1024 \
    --num-prompts 28 \
    --random-range-ratio 1 \
    --request-rate inf \
    --warmup-requests 8

Qwen3.5-397B W4A8 8P IN64K OUT1K PREFIX90 50ms

Model: Qwen3.5-397B Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 64k+1k (90% prefix cache hit rate) TPOT: 50ms

Model Deployment

Command

# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096
export DEEPEP_NORMAL_LONG_SEQ_ROUND=20
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=2200
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --attention-backend ascend \
    --device npu \
    --tp-size 16 \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 65536 \
    --max-mamba-cache-size 640 \
    --mamba-radix-cache-strategy extra_buffer \
    --trust-remote-code \
    --max-running-requests 128 \
    --mem-fraction-static 0.6 \
    --max-total-tokens 1310720 \
    --quantization modelslim \
    --enable-multimodal \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --dp-size 2 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 58982 = int(65536 * 0.9) is the shared prefix portion. --gsp-question-len 6553 = int(65536 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.

Command

python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 112 \
    --gsp-system-prompt-len 58982 \
    --gsp-question-len 6553 \
    --gsp-output-len 1024 \
    --max-concurrency 112 \
    --num-prompts 112 \
    --request-rate inf

​Low Latency

​High Throughput

​Optimal Configuration

​Qwen3.5-397B W4A8 8P IN128K OUT1K 20ms

​Model Deployment

​Benchmark

​Qwen3.5-397B W4A8 8P IN128K OUT1K 50ms

​Model Deployment

​Benchmark

​Qwen3.5-397B W4A8 8P IN128K OUT1K PREFIX90 50ms

​Model Deployment

​Benchmark

​Qwen3.5-397B W4A8 8P IN16K OUT1K 20ms

​Model Deployment

​Benchmark

​Qwen3.5-397B W4A8 8P IN16K OUT1K 50ms

​Model Deployment

​Benchmark

​Qwen3.5-397B W4A8 8P IN3K5 OUT1K5 20ms

​Model Deployment

​Benchmark

​Qwen3.5-397B W4A8 8P IN3K5 OUT1K5 50ms

​Model Deployment

​Benchmark

​Qwen3.5-397B W4A8 8P IN64K OUT1K 20ms

​Model Deployment

​Benchmark

​Qwen3.5-397B W4A8 8P IN64K OUT1K 50ms

​Model Deployment

​Benchmark

​Qwen3.5-397B W4A8 8P IN64K OUT1K PREFIX90 50ms

​Model Deployment

​Benchmark

Low Latency

High Throughput

Optimal Configuration

Qwen3.5-397B W4A8 8P IN128K OUT1K 20ms

Model Deployment

Benchmark

Qwen3.5-397B W4A8 8P IN128K OUT1K 50ms

Model Deployment

Benchmark

Qwen3.5-397B W4A8 8P IN128K OUT1K PREFIX90 50ms

Model Deployment

Benchmark

Qwen3.5-397B W4A8 8P IN16K OUT1K 20ms

Model Deployment

Benchmark

Qwen3.5-397B W4A8 8P IN16K OUT1K 50ms

Model Deployment

Benchmark

Qwen3.5-397B W4A8 8P IN3K5 OUT1K5 20ms

Model Deployment

Benchmark

Qwen3.5-397B W4A8 8P IN3K5 OUT1K5 50ms

Model Deployment

Benchmark

Qwen3.5-397B W4A8 8P IN64K OUT1K 20ms

Model Deployment

Benchmark

Qwen3.5-397B W4A8 8P IN64K OUT1K 50ms

Model Deployment

Benchmark

Qwen3.5-397B W4A8 8P IN64K OUT1K PREFIX90 50ms

Model Deployment

Benchmark