Quantization - SGLang Documentation

SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep the base model and the quantized transformer override separate.

Quick Reference

Use these paths:

--model-path: the base or original model
--transformer-path: a quantized transformers-style transformer component directory that already contains its own config.json
--transformer-weights-path: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID
--quantization: apply online quantization to unquantized models at load time (activations are quantized dynamically)
--quantization-ignored-layers layer name patterns to keep unquantized (e.g. attention.to_)

Recommended example for pre-quantized checkpoints:

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
  --prompt "a curious pikachu"

For quantized transformers-style transformer component folders:

sglang generate \
  --model-path /path/to/base-model \
  --transformer-path /path/to/quantized-transformer \
  --prompt "A Logo With Bold Large Text: SGL Diffusion"

NOTE: Some model-specific integrations also accept a quantized repo or local directory directly as --model-path, but that is a compatibility path. If a repo contains multiple candidate checkpoints, pass --transformer-weights-path explicitly.

Quant Families

Here, quant_family means a checkpoint and loading family with shared CLI usage and loader behavior. It is not just the numeric precision or a kernel backend.

quant_family	checkpoint form	canonical CLI	supported models	extra dependency	platform / notes
`fp8` / `mxfp4` (online quantization)	Unquantized checkpoint (offline via AMD Quark coming soon)	`—quantization {fp8,mxfp4}`	Z-Image-Turbo (validated), others likely work. More support coming soon.	MXFP4: `aiter` on ROCm	MXFP4 requires ROCm and MI350+ (gfx95x). Weights quantized at load time, activations quantized to `fp8` / `mxfp4` dynamically.
`fp8` (offline quantization)	Quantized transformer component folder, or safetensors with `quantization_config` metadata	`—transformer-path` or `—transformer-weights-path`	ALL	None	Component-folder and single-file flows are both supported
`modelopt-fp8`	Converted ModelOpt FP8 transformer directory or repo with `config.json`	`—transformer-path`	FLUX.1, FLUX.2, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image Edit	None	Serialized config stays `quant_method=modelopt` with `quant_algo=FP8`; `dit_layerwise_offload` is supported and `dit_cpu_offload` stays disabled
`modelopt-nvfp4`	Mixed transformer directory/repo with `config.json`, raw NVFP4 safetensors export/repo, or full ModelOpt Diffusers repo	`—transformer-path` for mixed overrides; `—transformer-weights-path` for raw exports; `—model-path` for full repos	FLUX.1, FLUX.2, Wan2.2, Qwen Image, Qwen Image 2512, Qwen Image Edit, Qwen Image Edit 2511	None	Mixed override repos keep the base model separate; full Qwen Image exports can be loaded directly as `—model-path`; raw exports such as `black-forest-labs/FLUX.2-dev-NVFP4` still use the weights-path flow
`nunchaku-svdq`	Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-…`	`—transformer-weights-path`	Model-specific support such as Qwen-Image, FLUX, and Z-Image	`nunchaku`	SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4`
`msmodelslim`	Pre-quantized msmodelslim transformer weights	`—model-path`	Wan2.2 family	None	Currently only compatible with the Ascend NPU family and supports `mxfp8`, `mxfp4`, `w8a8`, and `w4a4`

Online Quantization

Online quantization applies quantization to unquantized models at load time. This is useful for when pre-quantized checkpoints are not available.

FP8 Online Quantization

Apply FP8 quantization to any unquantized model:

sglang generate \
  --model-path Tongyi-MAI/Z-Image-Turbo \
  --quantization fp8 \
  --prompt "a beautiful sunset" \
  --save-output

MXFP4 Online Quantization

MXFP4 provides aggressive 4-bit compression with online quantization. Note: Requires ROCm and MI350+ (gfx95x) GPU.

sglang generate \
  --model-path Tongyi-MAI/Z-Image-Turbo \
  --quantization mxfp4 \
  --prompt "a beautiful sunset" \
  --save-output

Note: Requires aiter package with MXFP4 kernel support

Skipping Layers

By default, online quantization quantizes every linear layer in the transformer. However, --quantization-ignored-layers can be used to keep specific layers in their original precision:

sglang generate \
  --model-path Tongyi-MAI/Z-Image-Turbo \
  --quantization fp8 \
  --quantization-ignored-layers attention.to_ \
  --prompt "a beautiful sunset" \
  --save-output

sglang generate \
  --model-path Tongyi-MAI/Z-Image-Turbo \
  --quantization mxfp4 \
  --quantization-ignored-layers attention.to_ \
  --prompt "a beautiful sunset" \
  --save-output

Each pattern is matched against the full layer prefix (e.g. layers.0.attention.to_q). A layer is skipped and left unquantizd if its prefix contains any of the given patterns.

Validated ModelOpt Checkpoints

This section is the canonical support matrix for the thirteen published diffusion ModelOpt checkpoints currently wired up in SGLang docs and validation coverage. Published checkpoints keep the serialized quantization config as quant_method=modelopt; the FP8 vs NVFP4 split below is a documentation label derived from quant_algo. Twelve of the thirteen repos live under lmsys/*. The FLUX.2 NVFP4 entry keeps the official black-forest-labs/FLUX.2-dev-NVFP4 repo.

Quant Algo	Base Model	Preferred CLI	HF Repo	Current Scope	Notes
`FP8`	`black-forest-labs/FLUX.1-dev`	`—transformer-path`	`lmsys/flux1-dev-modelopt-fp8-sglang-transformer`	single-transformer override, deterministic latent/image comparison, H100 benchmark, torch-profiler trace	SGLang converter keeps a validated BF16 fallback set for modulation and FF projection layers; use `—model-id FLUX.1-dev` for local mirrors
`FP8`	`black-forest-labs/FLUX.2-dev`	`—transformer-path`	`lmsys/flux2-dev-modelopt-fp8-sglang-transformer`	single-transformer override load and generation path	published SGLang-ready transformer override
`FP8`	`Wan-AI/Wan2.2-T2V-A14B-Diffusers`	`—transformer-path`	`lmsys/wan22-t2v-a14b-modelopt-fp8-sglang-transformer`	primary `transformer` quantized, `transformer_2` kept BF16	primary-transformer-only path; keep `transformer_2` on the base checkpoint, and do not describe this as dual-transformer full-model FP8 unless that path is validated separately
`FP8`	`hunyuanvideo-community/HunyuanVideo`	`—transformer-path`	`lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer`	single-transformer override, BF16-vs-FP8 video comparison, H100 benchmark, torch-profiler trace	HunyuanVideo uses different ModelOpt/diffusers and SGLang runtime module names; the converter maps those names before writing FP8 scale tensors and BF16 fallback ignores
`FP8`	`Qwen/Qwen-Image`	`—transformer-path`	`lmsys/qwen-image-modelopt-fp8-sglang-transformer`	single-transformer override, BF16-vs-FP8 image comparison, H100 benchmark, torch-profiler trace	shares the Qwen Image FP8 fallback preset; keep `img_in`, `txt_in`, timestep embedder, `norm_out.linear`, `proj_out`, `img_mod`/`txt_mod`, and `img_mlp.net.2` in BF16
`FP8`	`Qwen/Qwen-Image-Edit-2511`	`—transformer-path`	`lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer`	TI2I edit path, BF16-vs-FP8 image comparison, H100 benchmark	shares `QwenImageTransformer2DModel` with Qwen Image and uses the same Qwen Image FP8 fallback preset
`NVFP4`	`black-forest-labs/FLUX.1-dev`	`—transformer-path`	`lmsys/flux1-dev-modelopt-nvfp4-sglang-transformer`	mixed BF16+NVFP4 transformer override, correctness validation, 4x RTX 5090 benchmark, torch-profiler trace	use `build_modelopt_nvfp4_transformer.py`; validated builder keeps selected FLUX.1 modules in BF16 and sets `swap_weight_nibbles=false`
`NVFP4`	`black-forest-labs/FLUX.2-dev`	`—transformer-weights-path`	`black-forest-labs/FLUX.2-dev-NVFP4`	packed-QKV load path	official raw export repo; validated packed export detection and runtime layout handling
`NVFP4`	`Wan-AI/Wan2.2-T2V-A14B-Diffusers`	`—transformer-path`	`lmsys/wan22-t2v-a14b-modelopt-nvfp4-sglang-transformer`	primary `transformer` quantized with ModelOpt NVFP4, `transformer_2` kept BF16	primary-transformer-only path; keep `transformer_2` on the base checkpoint; the default FP4 GEMM backend is `flashinfer_trtllm`
`NVFP4`	`Qwen/Qwen-Image`	`—model-path`	`lmsys/qwen-image-modelopt-nvfp4-sglang`	full ModelOpt NVFP4 Diffusers repo, BF16-vs-NVFP4 B200 image comparison	full repo loaded directly; exported with ModelOpt PR #1706 SVDQuant NVFP4 (`—format fp4`, max calibration, block size 16) and BF16 fallbacks for attention-sensitive modules plus first/last transformer blocks
`NVFP4`	`Qwen/Qwen-Image-2512`	`—model-path`	`lmsys/qwen-image-2512-modelopt-nvfp4-sglang`	full ModelOpt NVFP4 Diffusers repo, BF16-vs-NVFP4 B200 image comparison, B200 CI case	same full-repo loader path as Qwen Image; this is the Qwen Image NVFP4 representative in `multimodal-gen-test-1-b200`
`NVFP4`	`Qwen/Qwen-Image-Edit`	`—model-path`	`lmsys/qwen-image-edit-modelopt-nvfp4-sglang`	TI2I edit full ModelOpt NVFP4 Diffusers repo, BF16-vs-NVFP4 B200 image comparison	full repo loaded directly with normal image-edit inputs; exported with the same ModelOpt PR #1706 NVFP4 recipe
`NVFP4`	`Qwen/Qwen-Image-Edit-2511`	`—model-path`	`lmsys/qwen-image-edit-2511-modelopt-nvfp4-sglang`	TI2I edit full ModelOpt NVFP4 Diffusers repo, BF16-vs-NVFP4 B200 image comparison	full repo loaded directly with normal image-edit inputs; exported with the same ModelOpt PR #1706 NVFP4 recipe

These thirteen checkpoints are the intended ModelOpt documentation support set. The B200 diffusion CI job (multimodal-gen-test-1-b200) uses a representative NVFP4 subset and includes lmsys/qwen-image-2512-modelopt-nvfp4-sglang for Qwen Image coverage.

ModelOpt FP8

Usage Examples

Converted ModelOpt FP8 transformer repos should be loaded as transformer component overrides. If the repo or local directory already contains config.json, use --transformer-path. Full Diffusers repos such as the NVIDIA Wan2.2 FP8 checkpoint can be passed directly with --model-path.

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --transformer-path lmsys/flux2-dev-modelopt-fp8-sglang-transformer \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --save-output

sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --transformer-path lmsys/wan22-t2v-a14b-modelopt-fp8-sglang-transformer \
  --prompt "a fox walking through neon rain" \
  --save-output

sglang generate \
  --model-path hunyuanvideo-community/HunyuanVideo \
  --transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer \
  --height 544 --width 960 --num-frames 17 \
  --prompt "A cinematic shot of a red sports car driving through rain at night" \
  --save-output

sglang generate \
  --model-path Qwen/Qwen-Image \
  --transformer-path lmsys/qwen-image-modelopt-fp8-sglang-transformer \
  --prompt "A tiny astronaut reading a book under a glass greenhouse" \
  --save-output

sglang generate \
  --model-path Qwen/Qwen-Image-Edit-2511 \
  --transformer-path lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer \
  --image-path /path/to/input.png \
  --prompt "Turn the scene into a warm watercolor illustration" \
  --save-output

Notes

--transformer-path is the canonical flag for converted ModelOpt FP8 transformer component repos or directories that already carry config.json.
If the override repo or local directory contains its own config.json, SGLang reads the quantization config from that override instead of relying on the base model config.
--transformer-weights-path still works when you intentionally point at raw weight files or a directory that should be metadata-probed as weights first.
dit_layerwise_offload is supported for ModelOpt FP8 checkpoints.
dit_cpu_offload still stays disabled for ModelOpt FP8 checkpoints.
The layerwise offload path now preserves the non-contiguous FP8 weight stride expected by the runtime FP8 GEMM path.
On disk, the quantization config stays quant_method=modelopt with quant_algo=FP8; the modelopt-fp8 label in this document is a support family name, not a serialized config key.
To build the converted checkpoint yourself from a ModelOpt diffusers export, use python -m sglang.multimodal_gen.tools.build_modelopt_fp8_transformer.

ModelOpt NVFP4

Usage Examples

For mixed ModelOpt NVFP4 transformer overrides that already contain config.json, keep the base model and quantized transformer separate and use --transformer-path:

sglang generate \
  --model-path black-forest-labs/FLUX.1-dev \
  --transformer-path lmsys/flux1-dev-modelopt-nvfp4-sglang-transformer \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --save-output

For raw NVFP4 exports such as the official FLUX.2 release, use --transformer-weights-path:

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --save-output

SGLang also supports passing the NVFP4 repo or local directory directly as --model-path:

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev-NVFP4 \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --save-output

For a dual-transformer Wan2.2 export where only the primary transformer was quantized:

sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --transformer-path lmsys/wan22-t2v-a14b-modelopt-nvfp4-sglang-transformer \
  --prompt "a fox walking through neon rain" \
  --save-output

For full Qwen Image NVFP4 exports, load the published repo directly:

sglang generate \
  --model-path lmsys/qwen-image-2512-modelopt-nvfp4-sglang \
  --prompt "A tiny astronaut reading a book under a glass greenhouse" \
  --save-output

For high-resolution Qwen-Image-family generations on B200, the FlashInfer CUTLASS FP4 GEMM backend can be faster than the default TensorRT-LLM backend:

SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cutlass \
sglang generate \
  --model-path lmsys/qwen-image-2512-modelopt-nvfp4-sglang \
  --width 2048 --height 2048 \
  --prompt "A tiny astronaut reading a book under a glass greenhouse" \
  --save-output

Notes

Use --transformer-path for mixed ModelOpt NVFP4 transformer repos or local directories that already include config.json.
Use --transformer-weights-path for raw NVFP4 exports, individual safetensors files, or repo layouts that should be treated as weights first.
For dual-transformer pipelines such as Wan2.2-T2V-A14B-Diffusers, the primary --transformer-path override targets only transformer. Use a per-component override such as --transformer-2-path only when you intentionally want a non-default transformer_2.
On Blackwell, the diffusion ModelOpt NVFP4 path defaults to FlashInfer TensorRT-LLM FP4 GEMM (flashinfer_trtllm).
The published Qwen Image NVFP4 exports keep the img_mod/txt_mod modulation projections and first/last transformer blocks in BF16.
Qwen-Image NVFP4 does not always improve latency at 1024x1024. On B200, the validated ModelOpt exports were faster than BF16 at 2048x2048 with SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cutlass, while 1024x1024 remained BF16-faster.
Direct --model-path loading is the canonical path for full Qwen Image ModelOpt NVFP4 repos and a compatibility path for FLUX.2 NVFP4-style repos or local directories.
If --transformer-weights-path is provided explicitly, it takes precedence over the compatibility --model-path flow.
For local directories, SGLang first looks for *-mixed.safetensors, then falls back to loading from the directory.
To force the diffusion ModelOpt FP4 path onto a different FlashInfer backend, set SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND. Supported values include flashinfer_cudnn, flashinfer_cutlass, and flashinfer_trtllm.
On disk, the quantization config stays quant_method=modelopt with quant_algo=NVFP4; the modelopt-nvfp4 label here is again a documentation family name rather than a serialized config key.

Nunchaku (SVDQuant)

Install

Install the runtime dependency first:

pip install nunchaku

For platform-specific installation methods and troubleshooting, see the Nunchaku installation guide.

File Naming and Auto-Detection

For Nunchaku checkpoints, --model-path should still point to the original base model, while --transformer-weights-path points to the quantized transformer weights. If the basename of --transformer-weights-path contains the pattern svdq-(int4|fp4)_r{rank}, SGLang will automatically:

enable SVDQuant
infer --quantization-precision
infer --quantization-rank

Examples:

checkpoint name fragment	inferred precision	inferred rank	notes
`svdq-int4_r32`	`int4`	`32`	Standard INT4 checkpoint
`svdq-int4_r128`	`int4`	`128`	Higher-quality INT4 checkpoint
`svdq-fp4_r32`	`nvfp4`	`32`	`fp4` in the filename maps to CLI value `nvfp4`
`svdq-fp4_r128`	`nvfp4`	`128`	Higher-quality NVFP4 checkpoint

Common filenames:

filename	precision	rank	typical use
`svdq-int4_r32-qwen-image.safetensors`	`int4`	`32`	Balanced default
`svdq-int4_r128-qwen-image.safetensors`	`int4`	`128`	Quality-focused
`svdq-fp4_r32-qwen-image.safetensors`	`nvfp4`	`32`	RTX 50-series / NVFP4 path
`svdq-fp4_r128-qwen-image.safetensors`	`nvfp4`	`128`	Quality-focused NVFP4
`svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensors`	`int4`	`32`	Lightning 4-step
`svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensors`	`int4`	`128`	Lightning 8-step

If your checkpoint name does not follow this convention, pass --enable-svdquant, --quantization-precision, and --quantization-rank explicitly.

Usage Examples

Recommended auto-detected flow:

sglang generate \
  --model-path Qwen/Qwen-Image \
  --transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \
  --prompt "a beautiful sunset" \
  --save-output

Manual override when the filename does not encode the quant settings:

sglang generate \
  --model-path Qwen/Qwen-Image \
  --transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \
  --enable-svdquant \
  --quantization-precision int4 \
  --quantization-rank 128 \
  --prompt "a beautiful sunset" \
  --save-output

Notes

--transformer-weights-path is the canonical flag for Nunchaku checkpoints. Older config names such as quantized_model_path are treated as compatibility aliases.
Auto-detection only happens when the checkpoint basename matches svdq-(int4|fp4)_r{rank}.
The CLI values are int4 and nvfp4. In filenames, the NVFP4 variant is written as fp4.
Lightning checkpoints usually expect matching --num-inference-steps, such as 4 or 8.
Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x) or SM12x GPUs. Hopper (SM90) is currently rejected.

ModelSlim

MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.

Installation

# Clone repo and install msmodelslim:
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh

Multimodal_sd quantization Download the original floating-point weights of the large model. Taking Wan2.2-T2V-A14B as an example, you can go to Wan2.2-T2V-A14B to obtain the original model weights. Then install other dependencies (related to the model, refer to the modelscope model card).
Note: You can find pre-quantized validated models on modelscope/Eco-Tech.
Run quantization using one-click quantization (recommended):
msmodelslim quant \ --model_path /path/to/wan2_2_float_weights \ --save_path /path/to/wan2_2_quantized_weights \ --device npu \ --model_type Wan2_2 \ --quant_type w8a8 \ --trust_remote_code True
For more detailed examples of quantization of models, as well as information about their support, see the examples section in ModelSLim repo.
Note: SGLang does not support quantized embeddings, please disable this option when quantizing using msmodelslim.
Auto-Detection and different formats For msmodelslim checkpoints, it’s enough to specify only --model-path, the detection of quantization occurs automatically for each layer using parsing of quant_model_description.json config. In the case of Wan2.2 only Diffusers weights storage format are supported, whereas modelslim saves the quantized model in the original Wan2.2 format. For conversion, use the one-step wan_repack.py script:
python wan_repack.py \ --model-type Wan2.2-TI2V-5B \ --original-model-path {path_to_original_diffusers_model} \ --quant-path {path_to_quantized_model} \ --output-path {path_to_converted_model}
Supported --model-type values: Wan2.2-TI2V-5B (single-transformer), Wan2.2-T2V-A14B and Wan2.2-I2V-A14B (Cascade dual-transformer). The script automatically handles: copying the base model, converting quantized weights to Diffusers format, and restoring config.json.

Usage Example With auto-detected flow:

sglang generate \
  --model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-w8a8 \
  --prompt "a beautiful sunset" \
  --save-output

Available Quantization Methods:
- W4A4_DYNAMIC linear with online quantization of activations
- W8A8 linear with offline quantization of activations
- W8A8_DYNAMIC linear with online quantization of activations
- W8A8_MXFP8 linear with offline quantization (msmodelslim pre-quantized weights)
- mxfp8 linear with online quantization (--quantization mxfp8)
- W4A4_MXFP4 / W4A4_MXFP4_DUALSCALE linear with offline quantization (msmodelslim pre-quantized weights)
- mxfp4_npu linear with online quantization (--quantization mxfp4_npu)

MXFP8 Online Quantization

For online MXFP8 quantization, load the original FP16/BF16 model and add --quantization mxfp8. Weights are quantized at load time via npu_dynamic_mx_quant, and activations are quantized per-token during inference with npu_quant_matmul (block_size=32).

sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --prompt "a fox walking through neon rain" \
  --save-output

Hardware requirement: Ascend A5 series or newer. npu_dynamic_mx_quant is not available on A2/A3.

MXFP8 Offline Quantization (msmodelslim)

Pre-quantized MXFP8 weights exported by msmodelslim are auto-detected via quant_model_description.json (W8A8_MXFP8 scheme). Use wan_repack.py to convert the quantized weights to Diffusers format, then load the converted model with --model-path:

sglang generate \
  --model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-mxfp8 \
  --prompt "a beautiful sunset" \
  --save-output

MXFP4 Online Quantization

For online MXFP4 quantization on Ascend NPU, load the original FP16/BF16 model and add --quantization mxfp4_npu. The mxfp4_npu key is used for Ascend because mxfp4 is reserved for the ROCm/aiter backend. Weights are quantized at load time via npu_dynamic_dual_level_mx_quant, and activations are quantized per-token during inference before npu_dual_level_quant_matmul. MXFP4 uses dual-level block scales with an L1 block size of 32 and an L0 block size of 512.

sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp4_npu \
  --prompt "a fox walking through neon rain" \
  --save-output

Hardware requirement: Ascend A5 series or newer. npu_dynamic_dual_level_mx_quant and npu_dual_level_quant_matmul are not available on A2/A3. Note: Online MXFP4 weight quantization is experimental. The offline msmodelslim flow uses pre-quantized weights and may produce different numerical results.

MXFP4 Offline Quantization (msmodelslim)

Pre-quantized MXFP4 weights exported by msmodelslim are auto-detected via quant_model_description.json (W4A4_MXFP4 / W4A4_MXFP4_DUALSCALE scheme). Use wan_repack.py to convert the quantized weights to Diffusers format, then load the converted model with --model-path:

sglang generate \
  --model-path {path_to_converted_mxfp4_model} \
  --prompt "a beautiful sunset" \
  --save-output

The offline MXFP4 checkpoint stores weights in an FP8 container and includes dual-level scales (weight_scale, weight_dual_scale). If exported with smooth quantization, mul_scale is loaded and applied before activation quantization to keep activations aligned with the calibrated weights.

​Quick Reference

​Quant Families

​Online Quantization

​FP8 Online Quantization

​MXFP4 Online Quantization

​Skipping Layers

​Validated ModelOpt Checkpoints

​ModelOpt FP8

​Usage Examples

​Notes

​ModelOpt NVFP4

​Usage Examples

​Notes

​Nunchaku (SVDQuant)

​Install

​File Naming and Auto-Detection

​Usage Examples

​Notes

​ModelSlim

​MXFP8 Online Quantization

​MXFP8 Offline Quantization (msmodelslim)

​MXFP4 Online Quantization

​MXFP4 Offline Quantization (msmodelslim)

Quick Reference

Quant Families

Online Quantization

FP8 Online Quantization

MXFP4 Online Quantization

Skipping Layers

Validated ModelOpt Checkpoints

ModelOpt FP8

Usage Examples

Notes

ModelOpt NVFP4

Usage Examples

Notes

Nunchaku (SVDQuant)

Install

File Naming and Auto-Detection

Usage Examples

Notes

ModelSlim

MXFP8 Online Quantization

MXFP8 Offline Quantization (msmodelslim)

MXFP4 Online Quantization

MXFP4 Offline Quantization (msmodelslim)