> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Quantization on Ascend

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be automatically parsed from the downloaded `quant_model_description.json` or `config.json` config.

SGLang support **mix-bits** quantization (independently defines and loads each layer depending on the type of quantification specified in the `quant_model_description'.json`). [Advanced mix-bits for MoE](https://github.com/sgl-project/sglang/pull/17361) in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.

[ModelSlim on Ascend support](https://github.com/sgl-project/sglang/pull/14504)

<table>
  <thead>
    <tr>
      <th>Quantization scheme</th>
      <th>Layer type</th>
      <th>A2 Supported</th>
      <th>A3 Supported</th>
      <th>A5 Supported</th>
      <th>Diffusion models</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>W4A4 dynamic</td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
    </tr>

    <tr>
      <td>W8A8 static</td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
    </tr>

    <tr>
      <td>W8A8 dynamic</td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
    </tr>

    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/20922">MXFP8</a></td>
      <td>Linear</td>
      <td><strong style={{color: 'red'}}>x</strong></td>
      <td><strong style={{color: 'red'}}>x</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
    </tr>

    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/22338">MXFP4</a></td>
      <td>Linear</td>
      <td><strong style={{color: 'red'}}>x</strong></td>
      <td><strong style={{color: 'red'}}>x</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
    </tr>

    <tr>
      <td>W4A4 dynamic</td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
      <td><strong style={{color: 'red'}}>x</strong></td>
    </tr>

    <tr>
      <td>W4A8 dynamic</td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
      <td><strong style={{color: 'red'}}>x</strong></td>
    </tr>

    <tr>
      <td>W8A8 dynamic</td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
      <td><strong style={{color: 'red'}}>x</strong></td>
    </tr>

    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/20922">MXFP8</a></td>
      <td>MoE</td>
      <td><strong style={{color: 'red'}}>x</strong></td>
      <td><strong style={{color: 'red'}}>x</strong></td>
      <td><strong style={{color: 'blue'}}>WIP</strong></td>
      <td><strong style={{color: 'red'}}>x</strong></td>
    </tr>
  </tbody>
</table>

[AWQ on Ascend support](https://github.com/sgl-project/sglang/pull/10158):

<table>
  <thead>
    <tr>
      <th>Quantization scheme</th>
      <th>Layer type</th>
      <th>A2 Supported</th>
      <th>A3 Supported</th>
      <th>A5 Supported</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>W4A16</td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td>W8A16</td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td>W4A16</td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>
  </tbody>
</table>

GPTQ on Ascend support

<table>
  <thead>
    <tr>
      <th>Quantization scheme</th>
      <th>Layer type</th>
      <th>A2 Supported</th>
      <th>A3 Supported</th>
      <th>A5 Supported</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/15203">W4A16</a></td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/15203">W8A16</a></td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/16364">W4A16 MOE</a></td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/16364">W8A16 MOE</a></td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>
  </tbody>
</table>

[Auto-round on Ascend support](https://github.com/sgl-project/sglang/pull/16699)

<table>
  <thead>
    <tr>
      <th>Quantization scheme</th>
      <th>Layer type</th>
      <th>A2 Supported</th>
      <th>A3 Supported</th>
      <th>A5 Supported</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>W4A16</td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td>W8A16</td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td>W4A16</td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td>W8A16</td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>
  </tbody>
</table>

Compressed-tensors (LLM Compressor) on Ascend support:

<table>
  <thead>
    <tr>
      <th>Quantization scheme</th>
      <th>Layer type</th>
      <th>A2 Supported</th>
      <th>A3 Supported</th>
      <th>A5 Supported</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/14504">W8A8 dynamic</a></td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/14736">W4A8 dynamic with/without activation clip</a></td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/12759">W4A16 MOE</a></td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td><a href="https://github.com/sgl-project/sglang/pull/14504">W8A8 dynamic</a></td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>
  </tbody>
</table>

[GGUF on Ascend support](https://github.com/sgl-project/sglang/pull/17883)

<table>
  <thead>
    <tr>
      <th>Quantization type</th>
      <th>Layer type</th>
      <th>A2 Supported</th>
      <th>A3 Supported</th>
      <th>A5 Supported</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>All GGUF types (standard, K-quant)</td>
      <td>Linear</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>

    <tr>
      <td>All GGUF types (standard, K-quant)</td>
      <td>MoE</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
    </tr>
  </tbody>
</table>

**Usage Examples:**

* Dense model (e.g. Qwen3-14B-Q4\_K\_M.gguf):

```bash Command theme={null}
python3 -m sglang.launch_server \
    --model-path Qwen3-14B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.7 --tp-size 2
```

* MoE model (e.g. Qwen3-30B-A3B-Q4\_K\_M.gguf):

```bash Command theme={null}
python3 -m sglang.launch_server \
    --model-path Qwen3-30B-A3B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.8 --tp-size 2
```

> **Implementation Notes:**
>
> * GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead).
> * MoE layers use `npu_grouped_matmul` and `npu_moe_init_routing` / `npu_moe_finalize_routing` for high-performance expert computation.
> * TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models.

## Diffusion Model Quantization on Ascend NPU

SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3.

**Requirements for MXFP8:** CANN ≥ 8.0.RC3, Ascend A5

<table>
  <thead>
    <tr>
      <th>Quantization method</th>
      <th><code>quant\_type</code> in JSON</th>
      <th>Scheme class</th>
      <th>Mode</th>
      <th>A2/A3 Supported</th>
      <th>A5 Supported</th>
      <th>Trigger</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>MXFP8 (W8A8)</td>
      <td>—</td>
      <td><code>MXFP8Config</code></td>
      <td>Online</td>
      <td><strong style={{color: 'red'}}>x</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><code>--quantization mxfp8</code></td>
    </tr>

    <tr>
      <td>MXFP8 (W8A8)</td>
      <td><code>W8A8\_MXFP8</code></td>
      <td><code>ModelSlimMXFP8Scheme</code></td>
      <td>Offline</td>
      <td><strong style={{color: 'red'}}>x</strong></td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td>auto-detected from <code>quant\_model\_description.json</code></td>
    </tr>

    <tr>
      <td>W8A8 static</td>
      <td><code>W8A8</code></td>
      <td><code>ModelSlimW8A8Int8</code></td>
      <td>Offline</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
      <td>auto-detected from <code>quant\_model\_description.json</code></td>
    </tr>

    <tr>
      <td>W8A8 dynamic</td>
      <td><code>W8A8\_DYNAMIC</code></td>
      <td><code>ModelSlimW8A8Int8</code></td>
      <td>Offline</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
      <td>auto-detected from <code>quant\_model\_description.json</code></td>
    </tr>

    <tr>
      <td>W4A4 dynamic</td>
      <td><code>W4A4\_DYNAMIC</code></td>
      <td><code>ModelSlimW4A4Int4</code></td>
      <td>Offline</td>
      <td><strong style={{color: 'green'}}>√</strong></td>
      <td><strong style={{color: 'orange'}}>TBD</strong></td>
      <td>auto-detected from <code>quant\_model\_description.json</code></td>
    </tr>
  </tbody>
</table>

### Online MXFP8 Quantization

Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time using `npu_dynamic_mx_quant` + `npu_quant_matmul` CANN kernels. Pass `--quantization mxfp8` to override auto-detection.

```bash Command theme={null}
# Start the diffusion server with online MXFP8 quantization
sglang serve \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --num-gpus 4
```

```bash Command theme={null}
# One-shot generation
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --prompt "a beautiful sunset over the mountains" \
  --save-output
```

### Offline MXFP8 Quantization (ModelSlim)

For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected from `quant_model_description.json`, so no extra `--quantization` flag is needed.

**Step 1: Quantize with msModelSlim**

```bash Command theme={null}
msmodelslim quant \
  --model_path /path/to/wan2_2_float_weights \
  --save_path /path/to/wan2_2_mxfp8_weights \
  --device npu \
  --model_type Wan2_2 \
  --quant_type mxfp8 \
  --trust_remote_code True
```

> Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim.

**Step 2: Convert to Diffusers format**

msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script:

```bash Command theme={null}
python python/sglang/multimodal_gen/tools/wan_repack.py \
  --input-path /path/to/wan2_2_mxfp8_weights \
  --output-path /path/to/wan2_2_mxfp8_diffusers
```

Then copy all files from the original Diffusers checkpoint (except the `transformer`/`transformer_2` folders) into the output directory.

**Step 3: Run inference**

```bash Command theme={null}
sglang generate \
  --model-path /path/to/wan2_2_mxfp8_diffusers \
  --prompt "a beautiful sunset over the mountains" \
  --save-output
```

For pre-quantized checkpoints available on ModelScope, see [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech).
