Skip to main content
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config. SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers. ModelSlim on Ascend support
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 SupportedDiffusion models
W4A4 dynamicLinear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span><span style=“color: green;”>√</span>
W8A8 staticLinear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span><span style=“color: green;”>√</span>
W8A8 dynamicLinear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span><span style=“color: green;”>√</span>
MXFP8Linear<span style=“color: red;“>x</span><span style=“color: red;“>x</span><span style=“color: blue;“>WIP</span><span style=“color: blue;“>WIP</span>
W4A4 dynamicMoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span><span style=“color: red;“>x</span>
W4A8 dynamicMoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span><span style=“color: red;“>x</span>
W8A8 dynamicMoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span><span style=“color: red;“>x</span>
MXFP8MoE<span style=“color: red;“>x</span><span style=“color: red;“>x</span><span style=“color: blue;“>WIP</span><span style=“color: red;“>x</span>
AWQ on Ascend support:
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16Linear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W8A16Linear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W4A16MoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
GPTQ on Ascend support
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16Linear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W8A16Linear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W4A16 MOEMoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W8A16 MOEMoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
Auto-round on Ascend support
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16Linear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W8A16Linear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W4A16MoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W8A16MoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
Compressed-tensors (LLM Compressor) on Ascend support:
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W8A8 dynamicLinear<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W4A8 dynamic with/without activation clipMoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W4A16 MOEMoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
W8A8 dynamicMoE<span style=“color: green;”>√</span><span style=“color: green;”>√</span><span style=“color: yellow;“>TBD</span>
GGUF on Ascend support in progress