Documentation Index
Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
Use this file to discover all available pages before exploring further.
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.
SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.
ModelSlim on Ascend support
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported | Diffusion models |
|---|
| W4A4 dynamic | Linear | √ | √ | TBD | √ |
| W8A8 static | Linear | √ | √ | TBD | √ |
| W8A8 dynamic | Linear | √ | √ | TBD | √ |
| MXFP8 | Linear | x | x | √ | √ |
| W4A4 dynamic | MoE | √ | √ | TBD | x |
| W4A8 dynamic | MoE | √ | √ | TBD | x |
| W8A8 dynamic | MoE | √ | √ | TBD | x |
| MXFP8 | MoE | x | x | WIP | x |
AWQ on Ascend support:
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|
| W4A16 | Linear | √ | √ | TBD |
| W8A16 | Linear | √ | √ | TBD |
| W4A16 | MoE | √ | √ | TBD |
GPTQ on Ascend support
Auto-round on Ascend support
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|
| W4A16 | Linear | √ | √ | TBD |
| W8A16 | Linear | √ | √ | TBD |
| W4A16 | MoE | √ | √ | TBD |
| W8A16 | MoE | √ | √ | TBD |
Compressed-tensors (LLM Compressor) on Ascend support:
GGUF on Ascend support
| Quantization type | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|
| All GGUF types (standard, K-quant) | Linear | √ | √ | TBD |
| All GGUF types (standard, K-quant) | MoE | √ | √ | TBD |
Usage Examples:
- Dense model (e.g. Qwen3-14B-Q4_K_M.gguf):
python3 -m sglang.launch_server \
--model-path Qwen3-14B-Q4_K_M.gguf \
--device npu --attention-backend ascend \
--host 0.0.0.0 --port 30000 \
--mem-fraction-static 0.7 --tp-size 2
- MoE model (e.g. Qwen3-30B-A3B-Q4_K_M.gguf):
python3 -m sglang.launch_server \
--model-path Qwen3-30B-A3B-Q4_K_M.gguf \
--device npu --attention-backend ascend \
--host 0.0.0.0 --port 30000 \
--mem-fraction-static 0.8 --tp-size 2
Implementation Notes:
- GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead).
- MoE layers use
npu_grouped_matmul and npu_moe_init_routing / npu_moe_finalize_routing for high-performance expert computation.
- TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models.