--quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.
SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.
ModelSlim on Ascend support
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported | Diffusion models |
|---|---|---|---|---|---|
| W4A4 dynamic | Linear | √ | √ | TBD | √ |
| W8A8 static | Linear | √ | √ | TBD | √ |
| W8A8 dynamic | Linear | √ | √ | TBD | √ |
| MXFP8 | Linear | x | x | √ | √ |
| MXFP4 | Linear | x | x | √ | √ |
| W4A4 dynamic | MoE | √ | √ | TBD | x |
| W4A8 dynamic | MoE | √ | √ | TBD | x |
| W8A8 dynamic | MoE | √ | √ | TBD | x |
| MXFP8 | MoE | x | x | WIP | x |
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| W4A16 | Linear | √ | √ | TBD |
| W8A16 | Linear | √ | √ | TBD |
| W4A16 | MoE | √ | √ | TBD |
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| W4A16 | Linear | √ | √ | TBD |
| W8A16 | Linear | √ | √ | TBD |
| W4A16 MOE | MoE | √ | √ | TBD |
| W8A16 MOE | MoE | √ | √ | TBD |
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| W4A16 | Linear | √ | √ | TBD |
| W8A16 | Linear | √ | √ | TBD |
| W4A16 | MoE | √ | √ | TBD |
| W8A16 | MoE | √ | √ | TBD |
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| W8A8 dynamic | Linear | √ | √ | TBD |
| W4A8 dynamic with/without activation clip | MoE | √ | √ | TBD |
| W4A16 MOE | MoE | √ | √ | TBD |
| W8A8 dynamic | MoE | √ | √ | TBD |
| Quantization type | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| All GGUF types (standard, K-quant) | Linear | √ | √ | TBD |
| All GGUF types (standard, K-quant) | MoE | √ | √ | TBD |
- Dense model (e.g. Qwen3-14B-Q4_K_M.gguf):
Command
- MoE model (e.g. Qwen3-30B-A3B-Q4_K_M.gguf):
Command
Implementation Notes:
- GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead).
- MoE layers use
npu_grouped_matmulandnpu_moe_init_routing/npu_moe_finalize_routingfor high-performance expert computation.- TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models.
Diffusion Model Quantization on Ascend NPU
SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3. Requirements for MXFP8: CANN ≥ 8.0.RC3, Ascend A5| Quantization method | quant_type in JSON | Scheme class | Mode | A2/A3 Supported | A5 Supported | Trigger |
|---|---|---|---|---|---|---|
| MXFP8 (W8A8) | — | MXFP8Config | Online | x | √ | —quantization mxfp8 |
| MXFP8 (W8A8) | W8A8_MXFP8 | ModelSlimMXFP8Scheme | Offline | x | √ | auto-detected from quant_model_description.json |
| W8A8 static | W8A8 | ModelSlimW8A8Int8 | Offline | √ | TBD | auto-detected from quant_model_description.json |
| W8A8 dynamic | W8A8_DYNAMIC | ModelSlimW8A8Int8 | Offline | √ | TBD | auto-detected from quant_model_description.json |
| W4A4 dynamic | W4A4_DYNAMIC | ModelSlimW4A4Int4 | Offline | √ | TBD | auto-detected from quant_model_description.json |
Online MXFP8 Quantization
Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time usingnpu_dynamic_mx_quant + npu_quant_matmul CANN kernels. Pass --quantization mxfp8 to override auto-detection.
Command
Command
Offline MXFP8 Quantization (ModelSlim)
For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected fromquant_model_description.json, so no extra --quantization flag is needed.
Step 1: Quantize with msModelSlim
Command
Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim.Step 2: Convert to Diffusers format msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script:
Command
transformer/transformer_2 folders) into the output directory.
Step 3: Run inference
Command
