> ## Documentation Index > Fetch the complete documentation index at: https://docs.sglang.io/llms.txt > Use this file to discover all available pages before exploring further. # Quantization on Ascend To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be automatically parsed from the downloaded `quant_model_description.json` or `config.json` config. SGLang support **mix-bits** quantization (independently defines and loads each layer depending on the type of quantification specified in the `quant_model_description'.json`). [Advanced mix-bits for MoE](https://github.com/sgl-project/sglang/pull/17361) in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers. [ModelSlim on Ascend support](https://github.com/sgl-project/sglang/pull/14504)

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported	Diffusion models
W4A4 dynamic	Linear	√	√	TBD	√
W8A8 static	Linear	√	√	TBD	√
W8A8 dynamic	Linear	√	√	TBD	√
MXFP8	Linear	x	x	√	√
MXFP4	Linear	x	x	√	√
W4A4 dynamic	MoE	√	√	TBD	x
W4A8 dynamic	MoE	√	√	TBD	x
W8A8 dynamic	MoE	√	√	TBD	x
MXFP8	MoE	x	x	WIP	x

[AWQ on Ascend support](https://github.com/sgl-project/sglang/pull/10158):

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16	MoE	√	√	TBD

GPTQ on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16 MOE	MoE	√	√	TBD
W8A16 MOE	MoE	√	√	TBD

[Auto-round on Ascend support](https://github.com/sgl-project/sglang/pull/16699)

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16	MoE	√	√	TBD
W8A16	MoE	√	√	TBD

Compressed-tensors (LLM Compressor) on Ascend support:

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W8A8 dynamic	Linear	√	√	TBD
W4A8 dynamic with/without activation clip	MoE	√	√	TBD
W4A16 MOE	MoE	√	√	TBD
W8A8 dynamic	MoE	√	√	TBD

[GGUF on Ascend support](https://github.com/sgl-project/sglang/pull/17883)

Quantization type	Layer type	A2 Supported	A3 Supported	A5 Supported
All GGUF types (standard, K-quant)	Linear	√	√	TBD
All GGUF types (standard, K-quant)	MoE	√	√	TBD

**Usage Examples:** * Dense model (e.g. Qwen3-14B-Q4\_K\_M.gguf): ```bash Command theme={null} python3 -m sglang.launch_server \ --model-path Qwen3-14B-Q4_K_M.gguf \ --device npu --attention-backend ascend \ --host 0.0.0.0 --port 30000 \ --mem-fraction-static 0.7 --tp-size 2 ``` * MoE model (e.g. Qwen3-30B-A3B-Q4\_K\_M.gguf): ```bash Command theme={null} python3 -m sglang.launch_server \ --model-path Qwen3-30B-A3B-Q4_K_M.gguf \ --device npu --attention-backend ascend \ --host 0.0.0.0 --port 30000 \ --mem-fraction-static 0.8 --tp-size 2 ``` > **Implementation Notes:** > > * GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead). > * MoE layers use `npu_grouped_matmul` and `npu_moe_init_routing` / `npu_moe_finalize_routing` for high-performance expert computation. > * TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models. ## Diffusion Model Quantization on Ascend NPU SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3. **Requirements for MXFP8:** CANN ≥ 8.0.RC3, Ascend A5

Quantization method	`quant\_type` in JSON	Scheme class	Mode	A2/A3 Supported	A5 Supported	Trigger
MXFP8 (W8A8)	—	`MXFP8Config`	Online	x	√	`--quantization mxfp8`
MXFP8 (W8A8)	`W8A8\_MXFP8`	`ModelSlimMXFP8Scheme`	Offline	x	√	auto-detected from `quant\_model\_description.json`
W8A8 static	`W8A8`	`ModelSlimW8A8Int8`	Offline	√	TBD	auto-detected from `quant\_model\_description.json`
W8A8 dynamic	`W8A8\_DYNAMIC`	`ModelSlimW8A8Int8`	Offline	√	TBD	auto-detected from `quant\_model\_description.json`
W4A4 dynamic	`W4A4\_DYNAMIC`	`ModelSlimW4A4Int4`	Offline	√	TBD	auto-detected from `quant\_model\_description.json`

### Online MXFP8 Quantization Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time using `npu_dynamic_mx_quant` + `npu_quant_matmul` CANN kernels. Pass `--quantization mxfp8` to override auto-detection. ```bash Command theme={null} # Start the diffusion server with online MXFP8 quantization sglang serve \ --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ --quantization mxfp8 \ --num-gpus 4 ``` ```bash Command theme={null} # One-shot generation sglang generate \ --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ --quantization mxfp8 \ --prompt "a beautiful sunset over the mountains" \ --save-output ``` ### Offline MXFP8 Quantization (ModelSlim) For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected from `quant_model_description.json`, so no extra `--quantization` flag is needed. **Step 1: Quantize with msModelSlim** ```bash Command theme={null} msmodelslim quant \ --model_path /path/to/wan2_2_float_weights \ --save_path /path/to/wan2_2_mxfp8_weights \ --device npu \ --model_type Wan2_2 \ --quant_type mxfp8 \ --trust_remote_code True ``` > Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim. **Step 2: Convert to Diffusers format** msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script: ```bash Command theme={null} python python/sglang/multimodal_gen/tools/wan_repack.py \ --input-path /path/to/wan2_2_mxfp8_weights \ --output-path /path/to/wan2_2_mxfp8_diffusers ``` Then copy all files from the original Diffusers checkpoint (except the `transformer`/`transformer_2` folders) into the output directory. **Step 3: Run inference** ```bash Command theme={null} sglang generate \ --model-path /path/to/wan2_2_mxfp8_diffusers \ --prompt "a beautiful sunset over the mountains" \ --save-output ``` For pre-quantized checkpoints available on ModelScope, see [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech).