Quantization#
SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep the base model and the quantized transformer override separate.
Quick Reference#
Use these paths:
--model-path: the base or original model--transformer-path: a quantized transformers-style transformer component directory that already contains its ownconfig.json--transformer-weights-path: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID
Recommended example:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "a curious pikachu"
For quantized transformers-style transformer component folders:
sglang generate \
--model-path /path/to/base-model \
--transformer-path /path/to/quantized-transformer \
--prompt "A Logo With Bold Large Text: SGL Diffusion"
NOTE: Some model-specific integrations also accept a quantized repo or local
directory directly as --model-path, but that is a compatibility path. If a
repo contains multiple candidate checkpoints, pass
--transformer-weights-path explicitly.
Quant Families#
Here, quant_family means a checkpoint and loading family with shared CLI
usage and loader behavior. It is not just the numeric precision or a kernel
backend.
quant_family |
checkpoint form |
canonical CLI |
supported models |
extra dependency |
platform / notes |
|---|---|---|---|---|---|
|
Quantized transformer component folder, or safetensors with |
|
ALL |
None |
Component-folder and single-file flows are both supported |
|
Converted ModelOpt FP8 transformer directory or repo with |
|
FLUX.2, Wan2.2 |
None |
Override config is read from the quantized transformer repo; |
|
NVFP4 safetensors file, sharded directory, or repo providing transformer weights |
|
FLUX.2 |
|
Blackwell can use a best-performance kit when available; otherwise SGLang falls back to the generic ModelOpt FP4 path |
|
Pre-quantized Nunchaku transformer weights, usually named |
|
Model-specific support such as Qwen-Image, FLUX, and Z-Image |
|
SGLang can infer precision and rank from the filename and supports both |
|
Pre-quantized msmodelslim transformer weights |
|
Wan2.2 family |
None |
Currently only compatible with the Ascend NPU family and supports both |
ModelOpt FP8#
Usage Examples#
ModelOpt FP8 checkpoints should be converted into an SGLang-loadable transformer override first, then loaded with the original base model:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path BBuf/flux2-dev-modelopt-fp8-sglang-transformer \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
sglang generate \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--transformer-weights-path BBuf/wan22-t2v-a14b-modelopt-fp8-sglang-transformer \
--prompt "a fox walking through neon rain" \
--save-output
Notes#
--transformer-weights-pathis the canonical flag for converted ModelOpt FP8 diffusion checkpoints.If the override repo or local directory contains its own
config.json, SGLang reads the quantization config from that override instead of relying on the base model config.dit_cpu_offloadanddit_layerwise_offloadare automatically disabled for ModelOpt FP8 checkpoints because the runtime expects the transformed FP8 weights to remain GPU-resident in their column-major layout.To build the converted checkpoint yourself from a ModelOpt diffusers export, use
python -m sglang.multimodal_gen.tools.convert_modelopt_fp8_checkpoint.
NVFP4#
Usage Examples#
Recommended usage keeps the base model and quantized transformer override separate:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
SGLang also supports passing the NVFP4 repo or local directory directly as
--model-path:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
Notes#
--transformer-weights-pathis still the canonical CLI for NVFP4 transformer checkpoints.Direct
--model-pathloading is a compatibility path for FLUX.2 NVFP4-style repos or local directories.If
--transformer-weights-pathis provided explicitly, it takes precedence over the compatibility--model-pathflow.For local directories, SGLang first looks for
*-mixed.safetensors, then falls back to loading from the directory.On Blackwell,
comfy-kitchencan provide the best-performance path when available; otherwise SGLang falls back to the generic ModelOpt FP4 path.
Nunchaku (SVDQuant)#
Install#
Install the runtime dependency first:
pip install nunchaku
For platform-specific installation methods and troubleshooting, see the Nunchaku installation guide.
File Naming and Auto-Detection#
For Nunchaku checkpoints, --model-path should still point to the original
base model, while --transformer-weights-path points to the quantized
transformer weights.
If the basename of --transformer-weights-path contains the pattern
svdq-(int4|fp4)_r{rank}, SGLang will automatically:
enable SVDQuant
infer
--quantization-precisioninfer
--quantization-rank
Examples:
checkpoint name fragment |
inferred precision |
inferred rank |
notes |
|---|---|---|---|
|
|
|
Standard INT4 checkpoint |
|
|
|
Higher-quality INT4 checkpoint |
|
|
|
|
|
|
|
Higher-quality NVFP4 checkpoint |
Common filenames:
filename |
precision |
rank |
typical use |
|---|---|---|---|
|
|
|
Balanced default |
|
|
|
Quality-focused |
|
|
|
RTX 50-series / NVFP4 path |
|
|
|
Quality-focused NVFP4 |
|
|
|
Lightning 4-step |
|
|
|
Lightning 8-step |
If your checkpoint name does not follow this convention, pass
--enable-svdquant, --quantization-precision, and --quantization-rank
explicitly.
Usage Examples#
Recommended auto-detected flow:
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \
--prompt "a beautiful sunset" \
--save-output
Manual override when the filename does not encode the quant settings:
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \
--enable-svdquant \
--quantization-precision int4 \
--quantization-rank 128 \
--prompt "a beautiful sunset" \
--save-output
Notes#
--transformer-weights-pathis the canonical flag for Nunchaku checkpoints. Older config names such asquantized_model_pathare treated as compatibility aliases.Auto-detection only happens when the checkpoint basename matches
svdq-(int4|fp4)_r{rank}.The CLI values are
int4andnvfp4. In filenames, the NVFP4 variant is written asfp4.Lightning checkpoints usually expect matching
--num-inference-steps, such as4or8.Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x) or SM12x GPUs. Hopper (SM90) is currently rejected.
ModelSlim#
MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.
Installation
# Clone repo and install msmodelslim: git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh
Multimodal_sd quantization
Download the original floating-point weights of the large model. Taking Wan2.2-T2V-A14B as an example, you can go to Wan2.2-T2V-A14B to obtain the original model weights. Then install other dependencies (related to the model, refer to the modelscope model card).
Note: You can find pre-quantized validated models on modelscope/Eco-Tech.
Run quantization using one-click quantization (recommended):
msmodelslim quant \ --model_path /path/to/wan2_2_float_weights \ --save_path /path/to/wan2_2_quantized_weights \ --device npu \ --model_type Wan2_2 \ --quant_type w8a8 \ --trust_remote_code True
For more detailed examples of quantization of models, as well as information about their support, see the examples section in ModelSLim repo.
Note: SGLang does not support quantized embeddings, please disable this option when quantizing using msmodelslim.
Auto-Detection and different formats
For msmodelslim checkpoints, it’s enough to specify only
--model-path, the detection of quantization occurs automatically for each layer using parsing ofquant_model_description.jsonconfig.In the case of
Wan2.2onlyDiffusersweights storage format are supported, whereas modelslim saves the quantized model in the originalWan2.2format, for conversion in usepython/sglang/multimodal_gen/tools/wan_repack.pyscript:python wan_repack.py \ --input-path {path_to_quantized_model} \ --output-path {path_to_converted_model}
After that, please copy all files from original
Diffuserscheckpoint (instead oftransformer/tranfsormer_2folders)Usage Example
With auto-detected flow:
sglang generate \ --model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-w8a8 \ --prompt "a beautiful sunset" \ --save-output
Available Quantization Methods:
[x]
W4A4_DYNAMIClinear with online quantization of activations[x]
W8A8linear with offline quantization of activations[x]
W8A8_DYNAMIClinear with online quantization of activations[ ]
mxfp8linear in progress