Qwen3 Examples

If you need to download model weights, check the model size at ModelScope to reserve enough space.

Running Qwen3

Running Qwen3-32B on 1 x Atlas 800I A3.

Model weights could be found here

Launch Server

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1536
export HCCL_OP_EXPANSION_MODE=AIV

python -m sglang.launch_server \
   --device npu \
   --attention-backend ascend \
   --trust-remote-code \
   --tp-size 4 \
   --model-path Qwen/Qwen3-32B \
   --mem-fraction-static 0.8

Running Qwen3-32B on 1 x Atlas 800I A3 with Qwen3-32B-Eagle3.

Model weights could be found here Speculative model weights could be found here

Launch Server with Eagle3

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python -m sglang.launch_server \
   --device npu \
   --attention-backend ascend \
   --trust-remote-code \
   --tp-size 4 \
   --model-path Qwen/Qwen3-32B \
   --mem-fraction-static 0.8 \
   --speculative-algorithm EAGLE3 \
   --speculative-draft-model-path Qwen/Qwen3-32B-Eagle3 \
   --speculative-num-steps 1 \
   --speculative-eagle-topk 1 \
   --speculative-num-draft-tokens 2

Running Qwen3-30B-A3B MOE on 1 x Atlas 800I A3.

Model weights could be found here

Launch Server

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1536
export HCCL_OP_EXPANSION_MODE=AIV
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32

python -m sglang.launch_server \
   --device npu \
   --attention-backend ascend \
   --trust-remote-code \
   --tp-size 4 \
   --model-path Qwen/Qwen3-30B-A3B \
   --mem-fraction-static 0.8

Running Qwen3-235B-A22B-Instruct-2507 MOE on 1 x Atlas 800I A3.

Model weights could be found here

Launch Server

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1536
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32

python -m sglang.launch_server \
   --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
   --tp-size 16 \
   --trust-remote-code \
   --attention-backend ascend \
   --device npu \
   --watchdog-timeout 9000 \
   --mem-fraction-static 0.8

Running Qwen3-235B-A22B-Instruct-2507 with 256K long sequence on 2 x Atlas 800I A3 without CP

This example uses PD disaggregation for long-sequence inference and keeps context parallel disabled. Set the shared environment variables on both nodes first:

Command

export ASCEND_USE_FIA=1
export SGLANG_SET_CPU_AFFINITY=1
export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:12345"
export HCCL_SOCKET_IFNAME=<NETWORK_IFACE>
export GLOO_SOCKET_IFNAME=<NETWORK_IFACE>

MODEL_PATH=/root/.cache/modelscope/hub/models/zcgy26/Qwen3-235B-A22B-Instruct-2507-w8a8

Prefill node:

Command

export ASCEND_LAUNCH_BLOCKING=1
export HCCL_BUFFSIZE=1500
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024
export DEEPEP_NORMAL_LONG_SEQ_ROUND=128
export DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1

python3 -m sglang.launch_server \
   --model-path ${MODEL_PATH} \
   --disaggregation-mode prefill \
   --disaggregation-transfer-backend ascend \
   --disaggregation-bootstrap-port 8995 \
   --attention-backend ascend \
   --disable-radix-cache \
   --chunked-prefill-size -1 \
   --skip-server-warmup \
   --device npu \
   --tp-size 16 \
   --mem-fraction-static 0.45 \
   --max-running-requests 1 \
   --host <PREFILL_HOST_IP> \
   --port 8000 \
   --dist-init-addr <PREFILL_HOST_IP>:5000 \
   --nnodes 1 \
   --node-rank 0 \
   --moe-a2a-backend deepep \
   --deepep-mode normal

Decode node:

Command

export HCCL_BUFFSIZE=4000
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=4096
export DEEPEP_NORMAL_LONG_SEQ_ROUND=16

python3 -m sglang.launch_server \
   --model-path ${MODEL_PATH} \
   --disaggregation-mode decode \
   --disaggregation-transfer-backend ascend \
   --attention-backend ascend \
   --mem-fraction-static 0.8 \
   --disable-cuda-graph \
   --device npu \
   --disable-radix-cache \
   --chunked-prefill-size 8192 \
   --skip-server-warmup \
   --tp-size 16 \
   --max-running-requests 1 \
   --host <DECODE_HOST_IP> \
   --port 8232 \
   --moe-a2a-backend deepep \
   --deepep-mode low_latency \
   --disable-overlap-schedule

Router:

Command

python3 -m sglang_router.launch_router \
   --pd-disaggregation \
   --policy cache_aware \
   --prefill http://<PREFILL_HOST_IP>:8000 8995 \
   --decode http://<DECODE_HOST_IP>:8232 \
   --host <ROUTER_HOST_IP> \
   --port 6689 \
   --prometheus-port 29010

Running Qwen3-235B-A22B-Instruct-2507-W8A8 with Prefill Context Parallel (CP) on 2 x Atlas 800I A3

This example enables Prefill Context Parallel (--enable-prefill-context-parallel) to split the context across CP ranks during prefill, reducing per-device memory pressure and improving TTFT for long sequences. PD disaggregation is required.

Constraints

Prefill side must set --max-running-requests 1 (PCP only supports batch_size=1)

--attn-cp-size must evenly divide --tp-size; each CP rank occupies tp_size / cp_size NPUs

Prefill node <PREFILL_HOST_IP>:

Launch Server

export SGLANG_SET_CPU_AFFINITY=1
export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:23456"
export ASCEND_USE_FIA=True

python3 -m sglang.launch_server \
  --model-path /mnt/share/weights/Qwen3-235B-A22B-Instruct-2507-W8A8 \
  --trust-remote-code \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend ascend \
  --disaggregation-bootstrap-port 8995 \
  --quantization modelslim \
  --attention-backend ascend \
  --skip-server-warmup \
  --mem-fraction-static 0.7 \
  --chunked-prefill-size 32768 \
  --device npu \
  --base-gpu-id 0 \
  --tp-size 16 \
  --enable-prefill-context-parallel \
  --attn-cp-size 2 \
  --moe-dp-size 2 \
  --max-running-requests 1 \
  --host <PREFILL_HOST_IP> \
  --port 8000 \
  --nnodes 1 \
  --node-rank 0 \
  --dist-init-addr <PREFILL_HOST_IP>:6688

Key parameters for PCP:

Parameter	Value	Description
`--enable-prefill-context-parallel`	flag	Enable PCP feature
`--attn-cp-size`	2	Split context across 2 CP ranks (each rank handles half the sequence)
`--moe-dp-size`	2	MoE DP size, should match `--attn-cp-size`
`--max-running-requests`	1	Required by PCP (batch_size=1 constraint)

Decode node (<DECODE_HOST_IP>):

Launch Server

export ASCEND_MF_STORE_URL="tcp://141.61.39.231:23456"
export ASCEND_USE_FIA=True

python3 -m sglang.launch_server \
  --model-path /mnt/share/weights/Qwen3-235B-A22B-Instruct-2507-W8A8 \
  --trust-remote-code \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend ascend \
  --quantization modelslim \
  --attention-backend ascend \
  --disable-radix-cache \
  --disable-cuda-graph \
  --mem-fraction-static 0.7 \
  --chunked-prefill-size 32768 \
  --skip-server-warmup \
  --device npu \
  --base-gpu-id 0 \
  --tp-size 8 \
  --max-running-requests 32 \
  --host <DECODE_HOST_IP> \
  --port 8001 \
  --nnodes 1 \
  --node-rank 0 \
  --dist-init-addr <DECODE_HOST_IP>:6688

Note: ASCEND_MF_STORE_URL on both nodes must point to the same KV store (typically the Prefill node IP). ASCEND_USE_FIA=True enables fast interconnect aggregation for KV transfer. PCP is a Prefill-only feature; the Decode side needs no CP-related flags.

Running Qwen3-VL-8B-Instruct on 1 x Atlas 800I A3.

Model weights could be found here

Launch Server

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1536
export HCCL_OP_EXPANSION_MODE=AIV

python -m sglang.launch_server \
   --enable-multimodal \
   --attention-backend ascend \
   --mm-attention-backend ascend_attn \
   --trust-remote-code \
   --tp-size 4 \
   --model-path Qwen/Qwen3-VL-8B-Instruct \
   --mem-fraction-static 0.8

Testing the Service

Once the server prints The server is fired up and ready to roll! in the logs, it is ready to accept requests. For testing examples (Health Check, Generate, Chat Completions, and port usage guidance), see Testing the Service.

​Qwen3 examples

​Running Qwen3

​Running Qwen3-32B on 1 x Atlas 800I A3.

​Running Qwen3-32B on 1 x Atlas 800I A3 with Qwen3-32B-Eagle3.

​Running Qwen3-30B-A3B MOE on 1 x Atlas 800I A3.

​Running Qwen3-235B-A22B-Instruct-2507 MOE on 1 x Atlas 800I A3.

​Running Qwen3-235B-A22B-Instruct-2507 with 256K long sequence on 2 x Atlas 800I A3 without CP

​Running Qwen3-235B-A22B-Instruct-2507-W8A8 with Prefill Context Parallel (CP) on 2 x Atlas 800I A3

​Running Qwen3-VL-8B-Instruct on 1 x Atlas 800I A3.

​Testing the Service

Qwen3 examples

Running Qwen3

Running Qwen3-32B on 1 x Atlas 800I A3.

Running Qwen3-32B on 1 x Atlas 800I A3 with Qwen3-32B-Eagle3.

Running Qwen3-30B-A3B MOE on 1 x Atlas 800I A3.

Running Qwen3-235B-A22B-Instruct-2507 MOE on 1 x Atlas 800I A3.

Running Qwen3-235B-A22B-Instruct-2507 with 256K long sequence on 2 x Atlas 800I A3 without CP

Running Qwen3-235B-A22B-Instruct-2507-W8A8 with Prefill Context Parallel (CP) on 2 x Atlas 800I A3

Running Qwen3-VL-8B-Instruct on 1 x Atlas 800I A3.

Testing the Service