Skip to main content
This guide describes the best practice data for Qwen3.6-35B-A3B on the Ascend NPU.

Low Latency

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Qwen3.6-35B-A3BAtlas 800I A31PD Mixed254K+1K16.1msW8A8 INT8Optimal Configuration

High Throughput

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Qwen3.6-35B-A3BAtlas 800I A31PD Mixed1024x1024 (30)+102450msW8A8 INT8Optimal Configuration
Qwen3.6-35B-A3BAtlas 800I A31PD Mixed1080p_30+25650msW8A8 INT8Optimal Configuration
Qwen3.6-35B-A3BAtlas 800I A31PD Mixed128K+1K50msW8A8 INT8Optimal Configuration
Qwen3.6-35B-A3BAtlas 800I A31PD Mixed128K+1K (90% prefix cache hit rate)50msW8A8 INT8Optimal Configuration
Qwen3.6-35B-A3BAtlas 800I A31PD Mixed3.5K+1.5K50msW8A8 INT8Optimal Configuration
Qwen3.6-35B-A3BAtlas 800I A31PD Mixed64K+1K50msW8A8 INT8Optimal Configuration
Qwen3.6-35B-A3BAtlas 800I A31PD Mixed64K+1K (90% prefix cache hit rate)50msW8A8 INT8Optimal Configuration
Qwen3.6-35B-A3BAtlas 800I A32PD Mixed984K+1K40.91msW8A8 INT8Optimal Configuration

Optimal Configuration

Qwen3.6-35B-A3B 1P IN1024X1024 30 OUT1024 50ms

Model: Qwen3.6-35B-A3B Hardware: Atlas 800I A3 Cards: 1 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 1024x1024 (30)+1024 Format: resolution (input tokens) + output tokens TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=30
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --tp-size 2 \
    --nnodes 1 \
    --attention-backend ascend \
    --device npu \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 16384 \
    --disable-radix-cache \
    --trust-remote-code \
    --enable-prefill-delayer \
    --max-running-requests 120 \
    --max-mamba-cache-size 240 \
    --mem-fraction-static 0.78 \
    --cuda-graph-bs 4 8 16 24 32 48 64 80 96 112 120 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 120 \
    --random-input-len 30 \
    --random-output-len 1024 \
    --num-prompts 480 \
    --random-range-ratio 1 \
    --request-rate inf

Qwen3.6-35B-A3B 1P IN1080P 30 OUT256 50ms

Model: Qwen3.6-35B-A3B Hardware: Atlas 800I A3 Cards: 1 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 1080p_30+256 TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=10
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --tp-size 2 \
    --nnodes 1 \
    --attention-backend ascend \
    --device npu \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 16384 \
    --disable-radix-cache \
    --trust-remote-code \
    --enable-prefill-delayer \
    --max-running-requests 50 \
    --max-mamba-cache-size 55 \
    --mem-fraction-static 0.8 \
    --cuda-graph-bs 2 4 8 12 16 20 24 28 32 36 40 44 48 50 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 50 \
    --random-input-len 30 \
    --random-output-len 256 \
    --num-prompts 200 \
    --random-range-ratio 1 \
    --request-rate inf

Qwen3.6-35B-A3B 1P IN128K OUT1K 50ms

Model: Qwen3.6-35B-A3B Hardware: Atlas 800I A3 Cards: 1 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 128K+1K TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=1600
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=20
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --tp-size 2 \
    --nnodes 1 \
    --attention-backend ascend \
    --device npu \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 128000 \
    --disable-radix-cache \
    --trust-remote-code \
    --enable-prefill-delayer \
    --max-running-requests 3 \
    --max-mamba-cache-size 10 \
    --mem-fraction-static 0.63 \
    --cuda-graph-bs 1 2 3 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 3 \
    --random-input-len 128000 \
    --random-output-len 1000 \
    --num-prompts 12 \
    --random-range-ratio 1

Qwen3.6-35B-A3B 1P IN128K OUT1K PREFIX90 50ms

Model: Qwen3.6-35B-A3B Hardware: Atlas 800I A3 Cards: 1 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 128K+1K (90% prefix cache hit rate) TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=30
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --tp-size 2 \
    --nnodes 1 \
    --attention-backend ascend \
    --device npu \
    --chunked-prefill-size 16384 \
    --max-prefill-tokens 65536 \
    --trust-remote-code \
    --enable-prefill-delayer \
    --mamba-scheduler-strategy extra_buffer \
    --max-running-requests 103 \
    --max-mamba-cache-size 85 \
    --mem-fraction-static 0.85 \
    --cuda-graph-bs 2 4 8 16 32 48 64 80 96 103 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 57600 = int(64000 * 0.9) is the shared prefix portion. --gsp-question-len 6399 = int(64000 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 412 \
    --gsp-system-prompt-len 57600 \
    --gsp-question-len 6399 \
    --gsp-output-len 1000 \
    --max-concurrency 103 \
    --num-prompts 412 \
    --request-rate inf

Qwen3.6-35B-A3B 1P IN254K OUT1K

Model: Qwen3.6-35B-A3B Hardware: Atlas 800I A3 Cards: 1 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 254K+1K TPOT: 16.1ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --tp-size 2 \
    --nnodes 1 \
    --attention-backend ascend \
    --device npu \
    --chunked-prefill-size 131072 \
    --max-prefill-tokens 254000 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 1 \
    --max-mamba-cache-size 6 \
    --mem-fraction-static 0.65 \
    --cuda-graph-bs 1 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 1 \
    --random-input-len 254000 \
    --random-output-len 1000 \
    --num-prompts 1 \
    --random-range-ratio 1

Qwen3.6-35B-A3B 1P IN3K5 OUT1K5 50ms

Model: Qwen3.6-35B-A3B Hardware: Atlas 800I A3 Cards: 1 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=1
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --tp-size 2 \
    --nnodes 1 \
    --attention-backend ascend \
    --device npu \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 43400 \
    --disable-radix-cache \
    --trust-remote-code \
    --enable-prefill-delayer \
    --prefill-delayer-max-delay-passes 50 \
    --max-running-requests 124 \
    --max-mamba-cache-size 124 \
    --mem-fraction-static 0.8 \
    --cuda-graph-bs 4 16 32 64 96 112 116 120 124 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 124 \
    --random-input-len 3500 \
    --random-output-len 1500 \
    --num-prompts 496 \
    --random-range-ratio 1

Qwen3.6-35B-A3B 1P IN64K OUT1K 50ms

Model: Qwen3.6-35B-A3B Hardware: Atlas 800I A3 Cards: 1 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 64K+1K TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --tp-size 2 \
    --nnodes 1 \
    --attention-backend ascend \
    --device npu \
    --chunked-prefill-size -1 \
    --max-total-tokens 600000 \
    --max-prefill-tokens 65536 \
    --disable-radix-cache \
    --trust-remote-code \
    --enable-prefill-delayer \
    --max-running-requests 10 \
    --max-mamba-cache-size 20 \
    --mem-fraction-static 0.65 \
    --cuda-graph-bs 2 4 8 12 14 16 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 10 \
    --random-input-len 64000 \
    --random-output-len 1000 \
    --num-prompts 40 \
    --random-range-ratio 1

Qwen3.6-35B-A3B 1P IN64K OUT1K PREFIX90 50ms

Model: Qwen3.6-35B-A3B Hardware: Atlas 800I A3 Cards: 1 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 64K+1K (90% prefix cache hit rate) TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export GDN_ATTN_BACKEND_TRITON=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=300
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --tp-size 2 \
    --nnodes 1 \
    --attention-backend ascend \
    --device npu \
    --chunked-prefill-size -1 \
    --max-prefill-tokens 65536 \
    --trust-remote-code \
    --enable-prefill-delayer \
    --mamba-scheduler-strategy extra_buffer \
    --max-running-requests 42 \
    --max-mamba-cache-size 210 \
    --mem-fraction-static 0.71 \
    --cuda-graph-bs 2 8 16 24 32 36 40 42 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 58982 = int(65536 * 0.9) is the shared prefix portion. --gsp-question-len 6553 = int(65536 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 42 \
    --gsp-system-prompt-len 58982 \
    --gsp-question-len 6553 \
    --gsp-output-len 1024 \
    --max-concurrency 42 \
    --num-prompts 42 \
    --request-rate inf

Qwen3.6-35B-A3B 2P IN984K OUT1K

Model: Qwen3.6-35B-A3B Hardware: Atlas 800I A3 Cards: 2 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 984K+1K TPOT: 40.91ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_USE_FIA=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --tp-size 4 \
    --nnodes 1 \
    --attention-backend ascend \
    --device npu \
    --chunked-prefill-size 131072 \
    --max-prefill-tokens 984000 \
    --disable-radix-cache \
    --trust-remote-code \
    --max-running-requests 1 \
    --max-mamba-cache-size 6 \
    --mem-fraction-static 0.68 \
    --cuda-graph-bs 1 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 \
    --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --context-length 1010000

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 1 \
    --random-input-len 984000 \
    --random-output-len 1000 \
    --num-prompts 1 \
    --random-range-ratio 1