Skip to main content
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config. SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers. ModelSlim on Ascend support
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 SupportedDiffusion models
W4A4 dynamicLinearTBD
W8A8 staticLinearTBD
W8A8 dynamicLinearTBD
MXFP8Linearxx
MXFP4Linearxx
W4A4 dynamicMoETBDx
W4A8 dynamicMoETBDx
W8A8 dynamicMoETBDx
MXFP8MoExxWIPx
AWQ on Ascend support:
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16LinearTBD
W8A16LinearTBD
W4A16MoETBD
GPTQ on Ascend support
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16LinearTBD
W8A16LinearTBD
W4A16 MOEMoETBD
W8A16 MOEMoETBD
Auto-round on Ascend support
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16LinearTBD
W8A16LinearTBD
W4A16MoETBD
W8A16MoETBD
Compressed-tensors (LLM Compressor) on Ascend support:
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W8A8 dynamicLinearTBD
W4A8 dynamic with/without activation clipMoETBD
W4A16 MOEMoETBD
W8A8 dynamicMoETBD
GGUF on Ascend support
Quantization typeLayer typeA2 SupportedA3 SupportedA5 Supported
All GGUF types (standard, K-quant)LinearTBD
All GGUF types (standard, K-quant)MoETBD
Usage Examples:
  • Dense model (e.g. Qwen3-14B-Q4_K_M.gguf):
Command
python3 -m sglang.launch_server \
    --model-path Qwen3-14B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.7 --tp-size 2
  • MoE model (e.g. Qwen3-30B-A3B-Q4_K_M.gguf):
Command
python3 -m sglang.launch_server \
    --model-path Qwen3-30B-A3B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.8 --tp-size 2
Implementation Notes:
  • GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead).
  • MoE layers use npu_grouped_matmul and npu_moe_init_routing / npu_moe_finalize_routing for high-performance expert computation.
  • TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models.

Diffusion Model Quantization on Ascend NPU

SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3. Requirements for MXFP8: CANN ≥ 8.0.RC3, Ascend A5
Quantization methodquant_type in JSONScheme classModeA2/A3 SupportedA5 SupportedTrigger
MXFP8 (W8A8)MXFP8ConfigOnlinex—quantization mxfp8
MXFP8 (W8A8)W8A8_MXFP8ModelSlimMXFP8SchemeOfflinexauto-detected from quant_model_description.json
W8A8 staticW8A8ModelSlimW8A8Int8OfflineTBDauto-detected from quant_model_description.json
W8A8 dynamicW8A8_DYNAMICModelSlimW8A8Int8OfflineTBDauto-detected from quant_model_description.json
W4A4 dynamicW4A4_DYNAMICModelSlimW4A4Int4OfflineTBDauto-detected from quant_model_description.json

Online MXFP8 Quantization

Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time using npu_dynamic_mx_quant + npu_quant_matmul CANN kernels. Pass --quantization mxfp8 to override auto-detection.
Command
# Start the diffusion server with online MXFP8 quantization
sglang serve \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --num-gpus 4
Command
# One-shot generation
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --prompt "a beautiful sunset over the mountains" \
  --save-output

Offline MXFP8 Quantization (ModelSlim)

For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected from quant_model_description.json, so no extra --quantization flag is needed. Step 1: Quantize with msModelSlim
Command
msmodelslim quant \
  --model_path /path/to/wan2_2_float_weights \
  --save_path /path/to/wan2_2_mxfp8_weights \
  --device npu \
  --model_type Wan2_2 \
  --quant_type mxfp8 \
  --trust_remote_code True
Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim.
Step 2: Convert to Diffusers format msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script:
Command
python python/sglang/multimodal_gen/tools/wan_repack.py \
  --input-path /path/to/wan2_2_mxfp8_weights \
  --output-path /path/to/wan2_2_mxfp8_diffusers
Then copy all files from the original Diffusers checkpoint (except the transformer/transformer_2 folders) into the output directory. Step 3: Run inference
Command
sglang generate \
  --model-path /path/to/wan2_2_mxfp8_diffusers \
  --prompt "a beautiful sunset over the mountains" \
  --save-output
For pre-quantized checkpoints available on ModelScope, see modelscope/Eco-Tech.