Quick Reference
Use these paths:--model-path: the base or original model--transformer-path: a quantized transformers-style transformer component directory that already contains its ownconfig.json--transformer-weights-path: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID--quantization: apply online quantization to unquantized models at load time (activations are quantized dynamically)--quantization-ignored-layerslayer name patterns to keep unquantized (e.g.attention.to_)
--model-path, but that is a compatibility path. If a
repo contains multiple candidate checkpoints, pass
--transformer-weights-path explicitly.
Quant Families
Here,quant_family means a checkpoint and loading family with shared CLI
usage and loader behavior. It is not just the numeric precision or a kernel
backend.
| quant_family | checkpoint form | canonical CLI | supported models | extra dependency | platform / notes |
|---|---|---|---|---|---|
fp8 / mxfp4 (online quantization) | Unquantized checkpoint (offline via AMD Quark coming soon) | —quantization {fp8,mxfp4} | Z-Image-Turbo (validated), others likely work. More support coming soon. | MXFP4: aiter on ROCm | MXFP4 requires ROCm and MI350+ (gfx95x). Weights quantized at load time, activations quantized to fp8 / mxfp4 dynamically. |
fp8 (offline quantization) | Quantized transformer component folder, or safetensors with quantization_config metadata | —transformer-path or —transformer-weights-path | ALL | None | Component-folder and single-file flows are both supported |
modelopt-fp8 | Converted ModelOpt FP8 transformer directory or repo with config.json | —transformer-path | FLUX.1, FLUX.2, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image Edit | None | Serialized config stays quant_method=modelopt with quant_algo=FP8; dit_layerwise_offload is supported and dit_cpu_offload stays disabled |
modelopt-nvfp4 | Mixed transformer directory/repo with config.json, or raw NVFP4 safetensors export/repo | —transformer-path for mixed overrides; —transformer-weights-path for raw exports | FLUX.1, FLUX.2, Wan2.2 | None | Mixed override repos keep the base model separate; raw exports such as black-forest-labs/FLUX.2-dev-NVFP4 still use the weights-path flow |
nunchaku-svdq | Pre-quantized Nunchaku transformer weights, usually named svdq-{int4|fp4}_r{rank}-… | —transformer-weights-path | Model-specific support such as Qwen-Image, FLUX, and Z-Image | nunchaku | SGLang can infer precision and rank from the filename and supports both int4 and nvfp4 |
msmodelslim | Pre-quantized msmodelslim transformer weights | —model-path | Wan2.2 family | None | Currently only compatible with the Ascend NPU family and supports mxfp8, mxfp4, w8a8, and w4a4 |
Online Quantization
Online quantization applies quantization to unquantized models at load time. This is useful for when pre-quantized checkpoints are not available.FP8 Online Quantization
Apply FP8 quantization to any unquantized model:MXFP4 Online Quantization
MXFP4 provides aggressive 4-bit compression with online quantization. Note: Requires ROCm and MI350+ (gfx95x) GPU.aiter package with MXFP4 kernel support
Skipping Layers
By default, online quantization quantizes every linear layer in the transformer. However,--quantization-ignored-layers can be used to keep specific layers in their original precision:
layers.0.attention.to_q). A layer is skipped and left unquantizd if its prefix contains any of the given patterns.
Validated ModelOpt Checkpoints
This section is the canonical support matrix for the nine diffusion ModelOpt checkpoints currently wired up in SGLang docs and validation coverage. Published checkpoints keep the serialized quantization config asquant_method=modelopt; the FP8 vs NVFP4 split below is a documentation label
derived from quant_algo.
Six of the nine repos live under lmsys/*. The Wan2.2 entries use NVIDIA’s
official full Diffusers repos, and the FLUX.2 NVFP4 entry keeps the official
black-forest-labs/FLUX.2-dev-NVFP4 repo.
| Quant Algo | Base Model | Preferred CLI | HF Repo | Current Scope | Notes |
|---|---|---|---|---|---|
FP8 | black-forest-labs/FLUX.1-dev | —transformer-path | lmsys/flux1-dev-modelopt-fp8-sglang-transformer | single-transformer override, deterministic latent/image comparison, H100 benchmark, torch-profiler trace | SGLang converter keeps a validated BF16 fallback set for modulation and FF projection layers; use —model-id FLUX.1-dev for local mirrors |
FP8 | black-forest-labs/FLUX.2-dev | —transformer-path | lmsys/flux2-dev-modelopt-fp8-sglang-transformer | single-transformer override load and generation path | published SGLang-ready transformer override |
FP8 | Wan-AI/Wan2.2-T2V-A14B-Diffusers | —model-path | nvidia/Wan2.2-T2V-A14B-Diffusers-FP8 | full Diffusers repo with ModelOpt FP8 Wan2.2 components | validated through direct —model-path loading |
FP8 | hunyuanvideo-community/HunyuanVideo | —transformer-path | lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer | single-transformer override, BF16-vs-FP8 video comparison, H100 benchmark, torch-profiler trace | HunyuanVideo uses different ModelOpt/diffusers and SGLang runtime module names; the converter maps those names before writing FP8 scale tensors and BF16 fallback ignores |
FP8 | Qwen/Qwen-Image | —transformer-path | lmsys/qwen-image-modelopt-fp8-sglang-transformer | single-transformer override, BF16-vs-FP8 image comparison, H100 benchmark, torch-profiler trace | shares the Qwen Image FP8 fallback preset; keep img_in, txt_in, timestep embedder, norm_out.linear, proj_out, img_mod/txt_mod, and img_mlp.net.2 in BF16 |
FP8 | Qwen/Qwen-Image-Edit-2511 | —transformer-path | lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer | TI2I edit path, BF16-vs-FP8 image comparison, H100 benchmark | shares QwenImageTransformer2DModel with Qwen Image and uses the same Qwen Image FP8 fallback preset |
NVFP4 | black-forest-labs/FLUX.1-dev | —transformer-path | lmsys/flux1-dev-modelopt-nvfp4-sglang-transformer | mixed BF16+NVFP4 transformer override, correctness validation, 4x RTX 5090 benchmark, torch-profiler trace | use build_modelopt_nvfp4_transformer.py; validated builder keeps selected FLUX.1 modules in BF16 and sets swap_weight_nibbles=false |
NVFP4 | black-forest-labs/FLUX.2-dev | —transformer-weights-path | black-forest-labs/FLUX.2-dev-NVFP4 | packed-QKV load path | official raw export repo; validated packed export detection and runtime layout handling |
NVFP4 | Wan-AI/Wan2.2-T2V-A14B-Diffusers | —model-path | nvidia/Wan2.2-T2V-A14B-Diffusers-NVFP4 | full Diffusers repo with ModelOpt NVFP4 Wan2.2 components | default FP4 GEMM backend is flashinfer_trtllm |
multimodal-gen-test-1-b200).
ModelOpt FP8
Usage Examples
Converted ModelOpt FP8 transformer repos should be loaded as transformer component overrides. If the repo or local directory already containsconfig.json, use --transformer-path. Full Diffusers repos such as the
NVIDIA Wan2.2 FP8 checkpoint can be passed directly with --model-path.
Notes
--transformer-pathis the canonical flag for converted ModelOpt FP8 transformer component repos or directories that already carryconfig.json.- If the override repo or local directory contains its own
config.json, SGLang reads the quantization config from that override instead of relying on the base model config. --transformer-weights-pathstill works when you intentionally point at raw weight files or a directory that should be metadata-probed as weights first.dit_layerwise_offloadis supported for ModelOpt FP8 checkpoints.dit_cpu_offloadstill stays disabled for ModelOpt FP8 checkpoints.- The layerwise offload path now preserves the non-contiguous FP8 weight stride expected by the runtime FP8 GEMM path.
- On disk, the quantization config stays
quant_method=modeloptwithquant_algo=FP8; themodelopt-fp8label in this document is a support family name, not a serialized config key. - To build the converted checkpoint yourself from a ModelOpt diffusers export,
use
python -m sglang.multimodal_gen.tools.build_modelopt_fp8_transformer.
ModelOpt NVFP4
Usage Examples
For mixed ModelOpt NVFP4 transformer overrides that already containconfig.json, keep the base model and quantized transformer separate and use
--transformer-path:
--transformer-weights-path:
--model-path:
Notes
- Use
--transformer-pathfor mixed ModelOpt NVFP4 transformer repos or local directories that already includeconfig.json. - Use
--transformer-weights-pathfor raw NVFP4 exports, individual safetensors files, or repo layouts that should be treated as weights first. - For dual-transformer pipelines such as
Wan2.2-T2V-A14B-Diffusers, the primary--transformer-pathoverride targets onlytransformer. Use a per-component override such as--transformer-2-pathonly when you intentionally want a non-defaulttransformer_2. - On Blackwell, the diffusion ModelOpt NVFP4 path defaults to FlashInfer
TensorRT-LLM FP4 GEMM (
flashinfer_trtllm). - Direct
--model-pathloading is a compatibility path for FLUX.2 NVFP4-style repos or local directories. - If
--transformer-weights-pathis provided explicitly, it takes precedence over the compatibility--model-pathflow. - For local directories, SGLang first looks for
*-mixed.safetensors, then falls back to loading from the directory. - To force the diffusion ModelOpt FP4 path onto a different FlashInfer
backend, set
SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND. Supported values includeflashinfer_cudnn,flashinfer_cutlass, andflashinfer_trtllm. - On disk, the quantization config stays
quant_method=modeloptwithquant_algo=NVFP4; themodelopt-nvfp4label here is again a documentation family name rather than a serialized config key.
Nunchaku (SVDQuant)
Install
Install the runtime dependency first:File Naming and Auto-Detection
For Nunchaku checkpoints,--model-path should still point to the original
base model, while --transformer-weights-path points to the quantized
transformer weights.
If the basename of --transformer-weights-path contains the pattern
svdq-(int4|fp4)_r{rank}, SGLang will automatically:
- enable SVDQuant
- infer
--quantization-precision - infer
--quantization-rank
| checkpoint name fragment | inferred precision | inferred rank | notes |
|---|---|---|---|
svdq-int4_r32 | int4 | 32 | Standard INT4 checkpoint |
svdq-int4_r128 | int4 | 128 | Higher-quality INT4 checkpoint |
svdq-fp4_r32 | nvfp4 | 32 | fp4 in the filename maps to CLI value nvfp4 |
svdq-fp4_r128 | nvfp4 | 128 | Higher-quality NVFP4 checkpoint |
| filename | precision | rank | typical use |
|---|---|---|---|
svdq-int4_r32-qwen-image.safetensors | int4 | 32 | Balanced default |
svdq-int4_r128-qwen-image.safetensors | int4 | 128 | Quality-focused |
svdq-fp4_r32-qwen-image.safetensors | nvfp4 | 32 | RTX 50-series / NVFP4 path |
svdq-fp4_r128-qwen-image.safetensors | nvfp4 | 128 | Quality-focused NVFP4 |
svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensors | int4 | 32 | Lightning 4-step |
svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensors | int4 | 128 | Lightning 8-step |
--enable-svdquant, --quantization-precision, and --quantization-rank
explicitly.
Usage Examples
Recommended auto-detected flow:Notes
--transformer-weights-pathis the canonical flag for Nunchaku checkpoints. Older config names such asquantized_model_pathare treated as compatibility aliases.- Auto-detection only happens when the checkpoint basename matches
svdq-(int4|fp4)_r{rank}. - The CLI values are
int4andnvfp4. In filenames, the NVFP4 variant is written asfp4. - Lightning checkpoints usually expect matching
--num-inference-steps, such as4or8. - Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x) or SM12x GPUs. Hopper (SM90) is currently rejected.
ModelSlim
MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.-
Installation
-
Multimodal_sd quantization
Download the original floating-point weights of the large model. Taking Wan2.2-T2V-A14B as an example, you can go to Wan2.2-T2V-A14B to obtain the original model weights. Then install other dependencies (related to the model, refer to the modelscope model card).
Note: You can find pre-quantized validated models on modelscope/Eco-Tech.
Run quantization using one-click quantization (recommended):For more detailed examples of quantization of models, as well as information about their support, see the examples section in ModelSLim repo.Note: SGLang does not support quantized embeddings, please disable this option when quantizing using msmodelslim.
-
Auto-Detection and different formats
For msmodelslim checkpoints, it’s enough to specify only
--model-path, the detection of quantization occurs automatically for each layer using parsing ofquant_model_description.jsonconfig. In the case ofWan2.2onlyDiffusersweights storage format are supported, whereas modelslim saves the quantized model in the originalWan2.2format. For conversion, use the one-stepwan_repack.pyscript:Supported--model-typevalues:Wan2.2-TI2V-5B(single-transformer),Wan2.2-T2V-A14BandWan2.2-I2V-A14B(Cascade dual-transformer). The script automatically handles: copying the base model, converting quantized weights to Diffusers format, and restoringconfig.json. -
Usage Example
With auto-detected flow:
-
Available Quantization Methods:
-
W4A4_DYNAMIClinear with online quantization of activations -
W8A8linear with offline quantization of activations -
W8A8_DYNAMIClinear with online quantization of activations -
W8A8_MXFP8linear with offline quantization (msmodelslim pre-quantized weights) -
mxfp8linear with online quantization (--quantization mxfp8) -
W4A4_MXFP4/W4A4_MXFP4_DUALSCALElinear with offline quantization (msmodelslim pre-quantized weights) -
mxfp4_npulinear with online quantization (--quantization mxfp4_npu)
-
MXFP8 Online Quantization
For online MXFP8 quantization, load the original FP16/BF16 model and add--quantization mxfp8.
Weights are quantized at load time via npu_dynamic_mx_quant, and activations are quantized per-token
during inference with npu_quant_matmul (block_size=32).
Hardware requirement: Ascend A5 series or newer. npu_dynamic_mx_quant is not available on A2/A3.
MXFP8 Offline Quantization (msmodelslim)
Pre-quantized MXFP8 weights exported by msmodelslim are auto-detected viaquant_model_description.json
(W8A8_MXFP8 scheme). Use wan_repack.py to convert the quantized weights to Diffusers format,
then load the converted model with --model-path:
MXFP4 Online Quantization
For online MXFP4 quantization on Ascend NPU, load the original FP16/BF16 model and add--quantization mxfp4_npu. The mxfp4_npu key is used for Ascend because mxfp4
is reserved for the ROCm/aiter backend.
Weights are quantized at load time via npu_dynamic_dual_level_mx_quant, and activations
are quantized per-token during inference before npu_dual_level_quant_matmul. MXFP4 uses
dual-level block scales with an L1 block size of 32 and an L0 block size of 512.
Hardware requirement: Ascend A5 series or newer.npu_dynamic_dual_level_mx_quantandnpu_dual_level_quant_matmulare not available on A2/A3. Note: Online MXFP4 weight quantization is experimental. The offline msmodelslim flow uses pre-quantized weights and may produce different numerical results.
MXFP4 Offline Quantization (msmodelslim)
Pre-quantized MXFP4 weights exported by msmodelslim are auto-detected viaquant_model_description.json (W4A4_MXFP4 / W4A4_MXFP4_DUALSCALE scheme).
Use wan_repack.py to convert the quantized weights to Diffusers format, then load
the converted model with --model-path:
weight_scale, weight_dual_scale). If exported with smooth quantization,
mul_scale is loaded and applied before activation quantization to keep activations
aligned with the calibrated weights.