Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.sglang.io/llms.txt

Use this file to discover all available pages before exploring further.

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config. SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers. ModelSlim on Ascend support
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 SupportedDiffusion models
W4A4 dynamicLinearTBD
W8A8 staticLinearTBD
W8A8 dynamicLinearTBD
MXFP8Linearxx
W4A4 dynamicMoETBDx
W4A8 dynamicMoETBDx
W8A8 dynamicMoETBDx
MXFP8MoExxWIPx
AWQ on Ascend support:
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16LinearTBD
W8A16LinearTBD
W4A16MoETBD
GPTQ on Ascend support
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16LinearTBD
W8A16LinearTBD
W4A16 MOEMoETBD
W8A16 MOEMoETBD
Auto-round on Ascend support
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16LinearTBD
W8A16LinearTBD
W4A16MoETBD
W8A16MoETBD
Compressed-tensors (LLM Compressor) on Ascend support:
Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W8A8 dynamicLinearTBD
W4A8 dynamic with/without activation clipMoETBD
W4A16 MOEMoETBD
W8A8 dynamicMoETBD
GGUF on Ascend support
Quantization typeLayer typeA2 SupportedA3 SupportedA5 Supported
All GGUF types (standard, K-quant)LinearTBD
All GGUF types (standard, K-quant)MoETBD
Usage Examples:
  • Dense model (e.g. Qwen3-14B-Q4_K_M.gguf):
Command
python3 -m sglang.launch_server \
    --model-path Qwen3-14B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.7 --tp-size 2
  • MoE model (e.g. Qwen3-30B-A3B-Q4_K_M.gguf):
Command
python3 -m sglang.launch_server \
    --model-path Qwen3-30B-A3B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.8 --tp-size 2
Implementation Notes:
  • GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead).
  • MoE layers use npu_grouped_matmul and npu_moe_init_routing / npu_moe_finalize_routing for high-performance expert computation.
  • TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models.