--quantization to enable online quantization at the same time.
For popular pre-quantized models, please visit Unsloth, NVIDIA ModelOpt
or NeuralMagic collections on HF for some
popular quality validated quantized models. Quantized models must be validated via benchmarks post-quantization
to guard against abnormal quantization loss regressions.
Platform Compatibility
The following table summarizes quantization method support across NVIDIA and AMD GPUs, Ascend NPUs.| Method | NVIDIA GPUs | AMD GPUs (MI300X/MI325X/MI350X) | Ascend NPUs (A2/A3) | Notes |
|---|---|---|---|---|
fp8 | Yes | Yes | WIP | Aiter or Triton backend on AMD |
mxfp4 | Yes | Yes | WIP | Requires CDNA3/CDNA4 with MXFP support; uses Aiter |
blockwise_int8 | Yes | Yes | No | Triton-based, works on both platforms |
w8a8_int8 | Yes | Yes | No | |
w8a8_fp8 | Yes | Yes | No | Aiter or Triton FP8 on AMD |
awq | Yes | Yes | Yes | Uses Triton dequantize on AMD (vs. optimized CUDA kernels on NVIDIA). Uses CANN kernels on Ascend |
gptq | Yes | Yes | Yes | Uses Triton or vLLM kernels on AMD. Uses CANN kernels on Ascend |
compressed-tensors | Yes | Yes | Partial | Aiter paths for FP8/MoE on AMD. Uses CANN kernels on Ascend, FP8 not supported yet |
quark | Yes | Yes | No | AMD Quark quantization; Aiter GEMM paths on AMD |
auto-round | Yes | Yes | Partial | Platform-agnostic (Intel auto-round). Uses CANN kernels on Ascend |
quark_int4fp8_moe | No | Yes | No | AMD-only; online INT4-to-FP8 MoE quantization (CDNA3/CDNA4) |
awq_marlin | Yes | No | No | Marlin kernels are CUDA-only |
gptq_marlin | Yes | No | No | Marlin kernels are CUDA-only |
gguf | Yes | No | WIP | CUDA-only kernels in sgl-kernel |
modelopt / modelopt_fp8 | Yes (Hopper/SM90+) | No | No | NVIDIA ModelOpt; requires NVIDIA hardware |
modelopt_fp4 | Yes (Blackwell/SM100+) | No | No | NVIDIA ModelOpt; native FP4 on Blackwell (B200, GB200) |
petit_nvfp4 | No | Yes (MI250/MI300X/MI325X) | No | Enables NVFP4 on ROCm via Petit; use modelopt_fp4 on NVIDIA Blackwell. Auto-selected when loading NVFP4 models on AMD. See LMSYS blog and AMD ROCm blog. |
bitsandbytes | Yes | Experimental | No | Depends on bitsandbytes ROCm support |
torchao (int4wo, etc.) | Yes | Partial | No | int4wo not supported on AMD; other methods may work |
modelslim | No | No | Yes | Ascend quantization; Uses CANN kernels |
SGLANG_USE_AITER=1 where noted. See AMD GPU setup for installation and configuration details.
On Ascend, various layers quantization configurations are supported, see Ascend NPU quantization for details.
GEMM Backends for FP4/FP8 Quantization
Backend selection is supported only for blockwise FP8 and NVFP4 GEMM. When running FP8 or FP4 quantized models, you can select the GEMM backend via
--fp8-gemm-backend and --fp4-gemm-backend.--fp8-gemm-backend (Blockwise FP8 GEMM)
| Backend | Hardware | Description |
|---|---|---|
auto | All | Auto-selects based on hardware |
deep_gemm | SM90, SM100 | JIT-compiled; enabled when DeepGEMM is installed |
flashinfer_trtllm | SM100 | FlashInfer TensorRT-LLM backend; optimal for low-latency |
flashinfer_cutlass | SM100/120 | FlashInfer CUTLASS groupwise FP8 GEMM |
flashinfer_deepgemm | SM90 | Uses swapAB optimization for small M dimensions in decoding |
cutlass | SM90, SM100/120 | sgl-kernel CUTLASS |
triton | All | Fallback; widely compatible |
aiter | ROCm | AMD AITER backend |
auto selection order: 1) DeepGEMM (SM90/SM100, installed); 2) FlashInfer TRTLLM (SM100, FlashInfer available); 3) CUTLASS (SM90/SM100/120); 4) AITER (AMD); 5) Triton. Exception: SM120 always resolves to Triton.
--fp4-gemm-backend (NVFP4 GEMM)
| Backend | Hardware | Description |
|---|---|---|
auto | SM100/120 | Auto-selects: flashinfer_cudnn on SM120; flashinfer_cutlass on SM100 |
cutlass | SM100/120 | SGLang CUTLASS kernel |
flashinfer_cutlass | SM100/120 | FlashInfer CUTLASS backend |
flashinfer_cudnn | SM100/120 (CUDA 13+, cuDNN 9.15+) | FlashInfer cuDNN backend; used on SM120 for performance |
flashinfer_trtllm | SM100 | FlashInfer TensorRT-LLM backend |
Offline Quantization
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add--quantization argument when starting the engine. The quantization method will be parsed from the
downloaded Hugging Face or msModelSlim config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.
Command
--quantization w8a8_int8 or --quantization w8a8_fp8 to invoke the corresponding CUTLASS int8_kernel or fp8_kernel in sgl-kernel. This action will ignore the Hugging Face config’s quantization settings. For instance, with neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic, if you execute with --quantization w8a8_fp8, the system will use the W8A8Fp8Config from SGLang to invoke the sgl-kernel, rather than the CompressedTensorsConfig for vLLM kernels.
Command
Examples of Offline Model Quantization
Using Unsloth
We strongly suggest the use of Unsloth to quantize and load the model. Please refer to SGLang Deployment & Inference Guide with Unsloth.Using auto-round
Command
- LLM quantization
Example
- VLM quantization
Example
- Command Line Usage (Gaudi/CPU/Intel GPU/CUDA)
Command
- known issues
- Mixed-bit Quantization Limitations Mixed-bit quantization is not fully supported. Due to vLLM’s layer fusion (e.g., QKV fusion), applying different bit-widths to components within the same fused layer can lead to compatibility issues.
- Limited Support for Quantized MoE Models Quantized MoE models may encounter inference issues due to kernel limitations (e.g., lack of support for mlp.gate layer quantization). please try to skip quantizing these layers to avoid such errors.
-
Limited Support for Quantized VLMs
Details
Qwen2.5-VL-7Bauto_round:auto_gptq format: Accuracy is close to zero.GPTQ format: Fails with:auto_round:auto_awq and AWQ format: These work as expected.Output
Using GPTQModel
Command
Example
Using LLM Compressor
Command
meta-llama/Meta-Llama-3-8B-Instruct to FP8 as an example to elaborate on how to do offline quantization.
Example
SGLang, by using the following command:
Command
Using NVIDIA ModelOpt
NVIDIA Model Optimizer (ModelOpt) provides advanced quantization techniques optimized for NVIDIA hardware. Offline vs. Online Quantization: SGLang supports two modes for ModelOpt.-
Offline Quantization (pre-quantized):
- Usage: Download a pre-quantized model from Hugging Face or run
hf_ptq.pyonce to create a new quantized checkpoint. Then load this quantized checkpoint. - Pros: Fast server startup, quantization can be validated before deployment, efficient resource usage.
- Cons: Requires an extra preparation step.
- Usage: Download a pre-quantized model from Hugging Face or run
-
Online Quantization (quant and serve):
- Usage: Load a standard BF16/FP16 model and add a flag. The engine applies quantization on startup.
- Pros: Convenient (no new checkpoint needed).
- Cons: High startup time, increases VRAM usage during initialization (risk of OOM).
Using Pre-Quantized Checkpoints
If a model is already quantized (e.g., from Hugging Face), you can load it directly.-
FP8 Models:
Use
--quantization modelopt_fp8.Command -
FP4 Models:
Use
--quantization modelopt_fp4.Command
Creating Your Own Quantized Checkpoints
If a pre-quantized checkpoint is not available for your model, you can create one using NVIDIA Model Optimizer’shf_ptq.py script.
Why quantize?
- Reduce VRAM usage
- Higher throughput and lower latency
- More flexible deployment (on smaller GPUs)
- The entire model
- MLP layers only
- KV cache
hf_ptq.py:
--qformat: Quantization formats fp8, nvfp4, nvfp4_mlp_only
--kv_cache_qformat: KV cache quantization format (default: fp8)
Note: The default kv_cache_qformat may not be optimal for all use cases. Consider setting this explicitly.
Hardware requirements: Hopper and higher are recommended. Insufficient GPU memory may cause weight offloading, resulting in extremely long quantization time.
For detailed usage and supported model architectures, see NVIDIA Model Optimizer LLM PTQ.
SGLang includes a streamlined workflow for quantizing models with ModelOpt and automatically exporting them for deployment.
Installation
First, install ModelOpt:Command
Quantization and Export Workflow
SGLang provides an example script that demonstrates the complete ModelOpt quantization and export workflow. Run from the SGLang repository root (see modelopt_quantize_and_export.py):Command
Available Quantization Methods
modelopt_fp8: FP8 quantization with optimal performance on NVIDIA Hopper and Blackwell GPUsmodelopt_fp4: FP4 quantization with optimal performance on Nvidia Blackwell GPUs
Python API Usage
You can also use ModelOpt quantization programmatically:Example
Deploying Quantized Models
After quantization and export, you can deploy the model with SGLang:Command
modelopt_export_path from the quantize step):
Example
Advanced Features
Checkpoint Management: Save and restore fake quantized checkpoints for reuse:Command
Example
Benefits of ModelOpt
- Hardware Optimization: Specifically optimized for NVIDIA GPU architectures
- Advanced Quantization: Supports cutting-edge FP8 and FP4 quantization techniques
- Seamless Integration: Automatic export to HuggingFace format for easy deployment
- Calibration-based: Uses calibration datasets for optimal quantization quality
- Production Ready: Enterprise-grade quantization with NVIDIA support
Using ModelSlim
MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.-
Installation
Command
-
LLM quantization
Download the original floating-point weights of the large model. Taking Qwen3-32B as an example, you can go to Qwen3-32B to obtain the original model weights. Then install other dependencies (related to the model, refer to the huggingface model card).
Note: You can find pre-quantized validated models on modelscope/Eco-Tech.
Traditional quantification methods require the preparation of calibration data files (.jsonlformats) for calibration in the quantification process.Run quantization using one-click quantization (recommended):CommandCommand -
Usage Example
Command
-
Available Quantization Methods:
-
W4A4_DYNAMIClinear with online quantization of activations -
W8A8linear with offline quantization of activations -
W8A8_DYNAMIClinear with online quantization of activations -
W4A4_DYNAMICMOE with online quantization of activations -
W4A8_DYNAMICMOE with online quantization of activations -
W8A8_DYNAMICMOE with online quantization of activations -
W4A8linear TBD -
W4A16linear TBD -
W48A16linear TBD -
W4A16MoE in progress -
W8A16MoE in progress -
KV Cachein progress -
Attentionin progress
-
Online Quantization
To enable online quantization, you can simply specify--quantization in the command line. For example, you can launch the server with the following command to enable FP8 quantization for model meta-llama/Meta-Llama-3.1-8B-Instruct:
Command
["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"].
torchao online quantization method
SGLang also supports quantization methods based on torchao. You can simply specify--torchao-config in the command line to support this feature. For example, if you want to enable int4wo-128 for model meta-llama/Meta-Llama-3.1-8B-Instruct, you can launch the server with the following command:
Command
["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"].
Note: According to this issue, "int8dq" method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using "int8dq" method. Namely, please use the following command:
Command
quark_int4fp8_moe online quantization method
SGLang running on AMD GPUs (CDNA3 or CDNA4 architecture) supports the quantization method --quantization quark_int4fp8_moe, that will replace MoE layers originally in high precision (bfloat16, float16 or float32) to use weights dynamically quantized to int4, that are upcasted to float8 during inference to run compute in float8 precision with activations dynamically quantized on the fly to float8.
Other layers (e.g. projections in the attention layers) have their weights quantized online to float8 directly.
