Quantization on Ascend - SGLang Documentation

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config. SGLang supports mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers. ModelSlim on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported	Diffusion models
W4A4 dynamic	Linear	√	√	TBD	√
W8A8 static	Linear	√	√	TBD	√
W8A8 dynamic	Linear	√	√	TBD	√
MXFP8 (Diffusion, LLM dense)	Linear	x	x	√	√
MXFP4	Linear	x	x	√	√
MXFP4 W4A8	Linear	x	x	√	x
MXFP4 W4A4	Linear	x	x	WIP	x
W4A4 dynamic	MoE	√	√	TBD	x
W4A8 dynamic	MoE	√	√	TBD	x
W8A8 dynamic	MoE	√	√	TBD	x
MXFP8	MoE	x	x	WIP	x

AWQ on Ascend support:

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16	MoE	√	√	TBD

GPTQ on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16 MOE	MoE	√	√	TBD
W8A16 MOE	MoE	√	√	TBD

Auto-round on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16	MoE	√	√	TBD
W8A16	MoE	√	√	TBD

Compressed-tensors (LLM Compressor) on Ascend support:

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W8A8 dynamic	Linear	√	√	TBD
W4A8 dynamic with/without activation clip	MoE	√	√	TBD
W4A16 MOE	MoE	√	√	TBD
W8A8 dynamic	MoE	√	√	TBD

GGUF on Ascend support

Quantization type	Layer type	A2 Supported	A3 Supported	A5 Supported
All GGUF types (standard, K-quant)	Linear	√	√	TBD
All GGUF types (standard, K-quant)	MoE	√	√	TBD

Usage Examples:

Dense model (e.g., Qwen3-14B-Q4_K_M.gguf):

Command

python3 -m sglang.launch_server \
    --model-path Qwen3-14B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.7 --tp-size 2

MoE model (e.g., Qwen3-30B-A3B-Q4_K_M.gguf):

Command

python3 -m sglang.launch_server \
    --model-path Qwen3-30B-A3B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.8 --tp-size 2

Implementation Notes:

GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead).

MoE layers use npu_grouped_matmul and npu_moe_init_routing / npu_moe_finalize_routing for high-performance expert computation.

TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models.

MXFP8 for LLM dense models (e.g., Qwen3 / Qwen3.5): LLM dense W8A8 MXFP8 Linear support on Ascend was added in PR #22352. Requires Ascend A5 series or newer (npu_dynamic_mx_quant is not available on A2 / A3).

Online MXFP8 quantization (BF16/FP16 weights → MXFP8 at load time):

Command

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --quantization mxfp8 \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.8 --tp-size 1

Offline MXFP8 quantization (msmodelslim pre-quantized weights, W8A8_MXFP8 scheme; no --quantization flag needed — auto-detected from quant_model_description.json):

Command

python3 -m sglang.launch_server \
    --model-path /path/to/Qwen3-8B-W8A8-MXFP8 \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.8 --tp-size 1

Implementation Notes:

Online path: Fp8Config.get_quant_method() dispatches to NPUMXFP8LinearMethod. Weights are quantized once at load via npu_dynamic_mx_quant(weight, dst_type=torch_npu.float8_e4m3fn) and pre-transposed to [in, out]; activations are per-token quantized at inference and matmul runs via npu_quant_matmul(..., group_sizes=[1, 1, 32]) (block_size = 32).

Offline path: ModelSlimMXFP8Scheme loads float8_e4m3fn weights + float8_e8m0fnu block scales pre-exported by msmodelslim. Transpose is kept as a non-contiguous view (.data assignment) — calling .contiguous() would physically reorder the pre-quantized layout and break the block-scale mapping.

MoE MXFP8 (FusedMoE/TP) for LLMs is tracked separately and not part of this PR.

MXFP4 W4A8 for LLM dense models (e.g., Qwen3 / Qwen3.5): LLM dense W4A8 (MXFP4 4-bit weights + MXFP8 8-bit activations) Linear support was added in PR #23650. Requires Ascend A5 series or newer.

Online W4A8 quantization (BF16/FP16 weights → MXFP4 at load time):

Command

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --quantization mxfp_w4a8 \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.8 --tp-size 1

Offline W4A8 quantization (msmodelslim pre-quantized weights, W4A8_MXFP scheme; no --quantization flag needed — auto-detected from quant_model_description.json).

Implementation Notes:

Weights are packed FP4 (float4_e2m1fn_x2, two nibbles per byte) with a UE8M0 per-block shared exponent (block_size = 32); activations are per-token MXFP8. Matmul runs via npu_quant_matmul(..., x2_dtype=torch_npu.float4_e2m1fn_x2, group_sizes=[0, 0, 32]).

The packed-FP4 dtype passed to the NPU ops (dst_type / x2_dtype / input_dtype) must be resolved from torch_npu.float4_e2m1fn_x2 (an int enum), not the torch.float4_e2m1fn_x2 dtype object, which recent op-plugin builds reject.

Online and offline share the same kernel path and layout; they differ only in the weight source (RTN at load vs msmodelslim calibration).

MXFP4 W4A4 for LLM dense models (e.g. Qwen3 / Qwen3.5): LLM dense W4A4 (MXFP4 4-bit weights + 4-bit activations) Linear support was added in PR #23795. Requires Ascend A5 series (Ascend 950) or newer — the dual-level online path uses the DualLevelQuantBatchMatmul op, which A2/A3 lack. On the Ascend NPU backend --quantization mxfp4 selects this W4A4 path (on GPU the same flag selects the upstream OCP MXFP4 MoE config instead).

Online W4A4 quantization (BF16/FP16 weights → dual-level MXFP4 at load time):

Command

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --quantization mxfp4 \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.8 --tp-size 1

Offline W4A4 quantization (msmodelslim pre-quantized weights, W4A4_MXFP4 scheme; no --quantization flag needed — auto-detected from quant_model_description.json).

Implementation Notes:

Online (NPUDualLevelMXFP4LinearMethod) uses dual-level MXFP4: both weights and activations are quantized with a fine FP8 (E4M3) L0 block scale plus a coarser L1 scale via npu_dynamic_dual_level_mx_quant, and the matmul runs via npu_dual_level_quant_matmul (weight in FRACTAL_NZ). Dual-level captures per-block dynamic range far better than a single UE8M0 (power-of-2) scale, which is what made an earlier single-level RTN online path degenerate (greedy decoding could loop without emitting EOS).

Offline (ModelSlimMXFP4Scheme → NPUSingleLevelMXFP4OfflineLinearMethod) is single-level: msmodelslim’s W4A4_MXFP4 checkpoint ships single-level UE8M0 block scales (block_size = 32), so the matmul runs via npu_quant_matmul(..., x1_dtype=x2_dtype=torch_npu.float4_e2m1fn_x2, group_sizes=[1, 1, 32]). The online and offline paths therefore use different matmul kernels — they no longer share the matmul path.

As with W4A8, the packed-FP4 dtype passed to the NPU ops (dst_type / x2_dtype) must be resolved from torch_npu.float4_e2m1fn_x2 (an int enum), not the torch.float4_e2m1fn_x2 dtype object, which recent op-plugin builds reject.

Validated end-to-end on Ascend A5 hardware.

Diffusion Model Quantization on Ascend NPU

SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3. Requirements for MXFP8: CANN ≥ 8.0.RC3, Ascend A5

Quantization method	`quant_type` in JSON	Scheme class	Mode	A2/A3 Supported	A5 Supported	Trigger
MXFP8 (W8A8)	—	`MXFP8Config`	Online	x	√	`—quantization mxfp8`
MXFP8 (W8A8)	`W8A8_MXFP8`	`ModelSlimMXFP8Scheme`	Offline	x	√	auto-detected from `quant_model_description.json`
W8A8 static	`W8A8`	`ModelSlimW8A8Int8`	Offline	√	TBD	auto-detected from `quant_model_description.json`
W8A8 dynamic	`W8A8_DYNAMIC`	`ModelSlimW8A8Int8`	Offline	√	TBD	auto-detected from `quant_model_description.json`
W4A4 dynamic	`W4A4_DYNAMIC`	`ModelSlimW4A4Int4`	Offline	√	TBD	auto-detected from `quant_model_description.json`

Online MXFP8 Quantization

Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time using npu_dynamic_mx_quant + npu_quant_matmul CANN kernels. Pass --quantization mxfp8 to override auto-detection.

Command

# Start the diffusion server with online MXFP8 quantization
sglang serve \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --num-gpus 4

Command

# One-shot generation
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --prompt "a beautiful sunset over the mountains" \
  --save-output

Offline MXFP8 Quantization (ModelSlim)

For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected from quant_model_description.json, so no extra --quantization flag is needed. Step 1: Quantize with msModelSlim

Command

msmodelslim quant \
  --model_path /path/to/wan2_2_float_weights \
  --save_path /path/to/wan2_2_mxfp8_weights \
  --device npu \
  --model_type Wan2_2 \
  --quant_type mxfp8 \
  --trust_remote_code True

Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim.

Step 2: Convert to Diffusers format msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script:

Command

python python/sglang/multimodal_gen/tools/wan_repack.py \
  --input-path /path/to/wan2_2_mxfp8_weights \
  --output-path /path/to/wan2_2_mxfp8_diffusers

Then copy all files from the original Diffusers checkpoint (except the transformer/transformer_2 folders) into the output directory. Step 3: Run inference

Command

sglang generate \
  --model-path /path/to/wan2_2_mxfp8_diffusers \
  --prompt "a beautiful sunset over the mountains" \
  --save-output

For pre-quantized checkpoints available on ModelScope, see modelscope/Eco-Tech.

​Diffusion Model Quantization on Ascend NPU

​Online MXFP8 Quantization

​Offline MXFP8 Quantization (ModelSlim)

Diffusion Model Quantization on Ascend NPU

Online MXFP8 Quantization

Offline MXFP8 Quantization (ModelSlim)