Quantization on Ascend#
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.
SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers).
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
Diffusion models |
|---|---|---|---|---|---|
W4A4 dynamic |
Linear |
√ |
√ |
TBD |
√ |
W8A8 static |
Linear |
√ |
√ |
TBD |
√ |
W8A8 dynamic |
Linear |
√ |
√ |
TBD |
√ |
Linear |
x |
x |
WIP |
WIP |
|
W4A4 dynamic |
MoE |
√ |
√ |
TBD |
x |
W4A8 dynamic |
MoE |
√ |
√ |
TBD |
x |
W8A8 dynamic |
MoE |
√ |
√ |
TBD |
x |
MoE |
x |
x |
WIP |
x |
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
|---|---|---|---|---|
W4A16 |
Linear |
√ |
√ |
TBD |
W8A16 |
Linear |
√ |
√ |
TBD |
W4A16 |
MoE |
√ |
√ |
TBD |
GPTQ on Ascend support
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
|---|---|---|---|---|
Linear |
√ |
√ |
TBD |
|
Linear |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
|---|---|---|---|---|
W4A16 |
Linear |
√ |
√ |
TBD |
W8A16 |
Linear |
√ |
√ |
TBD |
W4A16 |
MoE |
√ |
√ |
TBD |
W8A16 |
MoE |
√ |
√ |
TBD |
Compressed-tensors (LLM Compressor) on Ascend support:
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
|---|---|---|---|---|
Linear |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
in progress