> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Qwen3-8B

This guide describes the best practice data for Qwen3-8B on the Ascend NPU.

### Low Latency

| Model    | Hardware      | Cards | Deploy Mode | Dataset   | TPOT    | Quantization | Configuration                                               |
| -------- | ------------- | ----- | ----------- | --------- | ------- | ------------ | ----------------------------------------------------------- |
| Qwen3-8B | Atlas 800I A3 | 1     | PD Mixed    | 3.5K+1.5K | 5ms     | W8A8 INT8    | [Optimal Configuration](#qwen3-8b-w8a8-1p-in3k5-out1k5-5ms) |
| Qwen3-8B | Atlas 800I A3 | 1     | PD Mixed    | 6K+1.5K   | 11.79ms | W8A8 INT8    | [Optimal Configuration](#qwen3-8b-w8a8-1p-in6k-out1k5-bs16) |

### High Throughput

| Model    | Hardware      | Cards | Deploy Mode | Dataset   | TPOT | Quantization | Configuration                                                |
| -------- | ------------- | ----- | ----------- | --------- | ---- | ------------ | ------------------------------------------------------------ |
| Qwen3-8B | Atlas 800I A3 | 1     | PD Mixed    | 3.5K+1.5K | 37ms | W8A8 INT8    | [Optimal Configuration](#qwen3-8b-w8a8-1p-in3k5-out1k5-37ms) |

## Optimal Configuration

### Qwen3-8B W8A8 1P IN3K5 OUT1K5 37ms

**Model**: Qwen3-8B

**Hardware**: Atlas 800I A3

**Cards**: 1

**Deploy Mode**: PD Mixed

**Quantization**: W8A8 INT8

**Dataset**: 3.5K+1.5K

**TPOT**: 37ms

#### Model Deployment

```bash Command theme={null}
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=50
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --trust-remote-code \
    --nnodes 1 \
    --node-rank 0 \
    --attention-backend ascend \
    --device npu \
    --quantization modelslim \
    --max-running-requests 70 \
    --max-prefill-tokens 16384 \
    --disable-radix-cache \
    --chunked-prefill-size 16384 \
    --tp-size 1 \
    --mem-fraction-static 0.85 \
    --cuda-graph-bs 8 12 24 36 48 51 55 60 63 64 66 68 70 \
    --dtype bfloat16 \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4
```

#### Benchmark

We tested it based on the `RANDOM` dataset.

```shell Command theme={null}
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 64 \
    --random-input-len 3500 \
    --random-output-len 1500 \
    --num-prompts 256 \
    --random-range-ratio 1
```

### Qwen3-8B W8A8 1P IN3K5 OUT1K5 5ms

**Model**: Qwen3-8B

**Hardware**: Atlas 800I A3

**Cards**: 1

**Deploy Mode**: PD Mixed

**Quantization**: W8A8 INT8

**Dataset**: 3.5K+1.5K

**TPOT**: 5ms

#### Model Deployment

```bash Command theme={null}
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --trust-remote-code \
    --nnodes 1 \
    --node-rank 0 \
    --attention-backend ascend \
    --device npu \
    --quantization modelslim \
    --max-running-requests 1 \
    --max-prefill-tokens 16384 \
    --disable-radix-cache \
    --chunked-prefill-size -1 \
    --tp-size 2 \
    --mem-fraction-static 0.894 \
    --cuda-graph-bs 1 \
    --dtype bfloat16 \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 5
```

#### Benchmark

We tested it based on the `RANDOM` dataset.

```shell Command theme={null}
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 1 \
    --random-input-len 3500 \
    --random-output-len 1500 \
    --num-prompts 4 \
    --random-range-ratio 1
```

### Qwen3-8B W8A8 1P IN6K OUT1K5 BS16

**Model**: Qwen3-8B

**Hardware**: Atlas 800I A3

**Cards**: 1

**Deploy Mode**: PD Mixed

**Quantization**: W8A8 INT8

**Dataset**: 6K+1.5K

**TPOT**: 11.79ms

#### Model Deployment

```bash Command theme={null}
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --trust-remote-code \
    --nnodes 1 \
    --node-rank 0 \
    --attention-backend ascend \
    --device npu \
    --quantization modelslim \
    --max-running-requests 16 \
    --max-prefill-tokens 16384 \
    --disable-radix-cache \
    --chunked-prefill-size -1 \
    --tp-size 2 \
    --mem-fraction-static 0.894 \
    --cuda-graph-bs 1 5 15 16 \
    --dtype bfloat16 \
    --speculative-draft-model-quantization unquant \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 5
```

#### Benchmark

We tested it based on the `RANDOM` dataset.

```shell Command theme={null}
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 16 \
    --random-input-len 6144 \
    --random-output-len 1500 \
    --num-prompts 16 \
    --random-range-ratio 1
```
