Quantization on Ascend

Quantization on Ascend#

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.

SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers).

ModelSlim on Ascend support

Quantization scheme

Layer type

A2 Supported

A3 Supported

A5 Supported

Diffusion models

W4A4 dynamic

Linear

TBD

W8A8 static

Linear

TBD

W8A8 dynamic

Linear

TBD

MXFP8

Linear

x

x

WIP

WIP

W4A4 dynamic

MoE

TBD

x

W4A8 dynamic

MoE

TBD

x

W8A8 dynamic

MoE

TBD

x

MXFP8

MoE

x

x

WIP

x

AWQ on Ascend support:

Quantization scheme

Layer type

A2 Supported

A3 Supported

A5 Supported

W4A16

Linear

TBD

W8A16

Linear

TBD

W4A16

MoE

TBD

GPTQ on Ascend support

Quantization scheme

Layer type

A2 Supported

A3 Supported

A5 Supported

W4A16

Linear

TBD

W8A16

Linear

TBD

W4A16 MOE

MoE

TBD

W8A16 MOE

MoE

TBD

Auto-round on Ascend support

Quantization scheme

Layer type

A2 Supported

A3 Supported

A5 Supported

W4A16

Linear

TBD

W8A16

Linear

TBD

W4A16

MoE

TBD

W8A16

MoE

TBD

Compressed-tensors (LLM Compressor) on Ascend support:

Quantization scheme

Layer type

A2 Supported

A3 Supported

A5 Supported

W8A8 dynamic

Linear

TBD

W4A8 dynamic with/without activation clip

MoE

TBD

W4A16 MOE

MoE

TBD

W8A8 dynamic

MoE

TBD

GGUF on Ascend support

in progress