Quantization on Ascend#

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.

SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers).

ModelSlim on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported	Diffusion models
W4A4 dynamic	Linear	√	√	TBD	√
W8A8 static	Linear	√	√	TBD	√
W8A8 dynamic	Linear	√	√	TBD	√
MXFP8	Linear	x	x	WIP	WIP
W4A4 dynamic	MoE	√	√	TBD	x
W4A8 dynamic	MoE	√	√	TBD	x
W8A8 dynamic	MoE	√	√	TBD	x
MXFP8	MoE	x	x	WIP	x

AWQ on Ascend support:

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16	MoE	√	√	TBD

GPTQ on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16 MOE	MoE	√	√	TBD
W8A16 MOE	MoE	√	√	TBD

Auto-round on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16	MoE	√	√	TBD
W8A16	MoE	√	√	TBD

Compressed-tensors (LLM Compressor) on Ascend support:

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W8A8 dynamic	Linear	√	√	TBD
W4A8 dynamic with/without activation clip	MoE	√	√	TBD
W4A16 MOE	MoE	√	√	TBD
W8A8 dynamic	MoE	√	√	TBD

GGUF on Ascend support

in progress