Skip to main content
Progressive resolution growing is an experimental feature for selected SGLang Diffusion pipelines. It runs early denoising steps at a coarser latent resolution and spectrally upsamples the latent before the full-resolution steps. On the benchmark setup below, this reduces the quadratic attention cost of the DiT transformer and yields up to 1.63× speedup on FLUX.1, 1.93× speedup on FLUX.2, 2.33× speedup on Z-Image, 2.78× speedup on Wan 2.1 T2V, and 1.69× speedup on Qwen-Image. Based on Spectral Progressive Diffusion (arXiv 2605.18736).

Overview

DiT attention is O(n²) in sequence length. Running the first N denoising steps at half the spatial resolution cuts the attention cost to ~6% for those steps. The transition point — how many steps to run at each resolution — is computed from the Bayes-optimal frequency-activation criterion: frequencies that cannot be resolved at the coarse scale are not denoised there. The method is designed to preserve quality under this criterion, but generated outputs can still differ from the full-resolution baseline.
ModelFull-res tokensHalf-res tokensToken-step ratio
FLUX.1 1024×10244,0961,0244.0×
FLUX.2 1024×10244,0961,0244.0×
Z-Image 1024×10244,0961,0244.0×
Wan 2.1 T2V 480×832 (81 frames)6,2401,5604.0×

Parameters

ParameterCLI flagDefaultDescription
progressive_mode--progressive-mode"fullres""fullres" disables (identical to standard generation). "dct_rewind" enables spectral upsample with scheduler rewind (recommended). "dct" enables upsample without rewind.
progressive_levels--progressive-levels1Number of resolution halvings. 1 = one coarse stage (64×64 latent → 128×128). 2 = two coarse stages (32×32 → 64×64 → 128×128).
progressive_delta--progressive-delta0.01Noise-dominated tolerance δ. Controls how many steps run at coarse resolution. Higher δ = more coarse steps = more speedup.
Tip: Add --dit-cpu-offload false to keep the transformer GPU-resident. With CPU offload each step pays a fixed PCIe transfer cost regardless of sequence length, which dilutes the speedup.

FLUX.1

Usage

sglang generate \
    --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --num-inference-steps 50 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.05

Choosing delta

δCoarse steps (50 total)Denoising speedup
0.0118 @ 64² + 32 @ 128²1.32×
0.0528 @ 64² + 22 @ 128²1.63×
For most prompts 0.05 is recommended — it gives the largest speedup with no visible degradation.

Benchmark

Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.
ConfigStage splitDenoiseSpeedup
Fullres (baseline)50 @ 128² latent36.65 s1.00×
dct_rewind L1 δ=0.0118@64² + 32@128²27.67 s1.32×
dct_rewind L1 δ=0.0528@64² + 22@128²22.58 s1.62×
dct_rewind L2 δ=0.0110@32² + 8@64² + 32@128²26.48 s1.38×

Python API

from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="black-forest-labs/FLUX.1-dev",
    dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A serene mountain lake at golden hour, photorealistic",
    "num_inference_steps": 50,
    "height": 1024,
    "width": 1024,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.05,
})

FLUX.2

Supports FLUX.2-dev, FLUX.2-klein-4B, and FLUX.2-klein-9B.

Usage

sglang generate \
    --model-path black-forest-labs/FLUX.2-klein-4B \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --num-inference-steps 30 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.10

Benchmark

Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Model: FLUX.2-klein-4B, 30 steps, 1024×1024. Timing = denoising loop only, averaged across 10 diverse prompts.
ConfigStage splitDenoiseSpeedup
Fullres (baseline)30 @ 64² latent9.72 s1.00×
dct_rewind L1 δ=0.0518@32² + 12@64²5.50 s1.77×
dct_rewind L1 δ=0.1020@32² + 10@64²5.03 s1.93×

Python API

from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="black-forest-labs/FLUX.2-klein-4B",
    dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A serene mountain lake at golden hour, photorealistic",
    "num_inference_steps": 30,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.10,
})

Wan 2.1 T2V

Supports Wan-AI/Wan2.1-T2V-1.3B-Diffusers and Wan-AI/Wan2.1-T2V-14B-Diffusers.
Note: Progressive generation grows only the spatial H×W dimensions. The temporal dimension T (number of latent frames) is kept fixed across all stages.

Usage

sglang generate \
    --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic" \
    --num-inference-steps 50 \
    --num-frames 81 \
    --height 480 \
    --width 832 \
    --guidance-scale 5.0 \
    --flow-shift 5.0 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.05

Choosing delta

δCoarse steps (50 total)Denoising speedup
0.0123 @ 30×52 + 27 @ 60×1041.65×
0.0227 @ 30×52 + 23 @ 60×1041.86×
0.0533 @ 30×52 + 17 @ 60×1042.32×
0.1037 @ 30×52 + 13 @ 60×1042.78×
For most prompts 0.05 is recommended. 0.10 provides maximum speedup but should be validated on motion-heavy scenes.

Python API

from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    dit_cpu_offload=False,
    flow_shift=5.0,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic",
    "num_inference_steps": 50,
    "num_frames": 81,
    "height": 480,
    "width": 832,
    "guidance_scale": 5.0,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.05,
})

Z-Image

Supports Tongyi-MAI/Z-Image. Z-Image uses the same VAE as FLUX.1 (FluxVAEConfig), so the power-law spectrum constants are identical. The progressive stage handles Z-Image’s 5-D latent format [B, C, 1, H, W] with squeeze/unsqueeze hooks and recomputes caption+image RoPE positional embeddings on each stage transition.
Note: Always specify --height 1024 --width 1024 (or another resolution where H_lat and W_lat are both divisible by 2). Z-Image’s default resolution (360×640) produces a 45×80 latent where H=45 is not divisible by the patch size.

Usage

# Standard fullres — unchanged behavior
sglang generate --model-path Tongyi-MAI/Z-Image \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --height 1024 --width 1024

# Progressive dct_rewind L1 δ=0.10 → 2.33× denoising speedup
sglang generate --model-path Tongyi-MAI/Z-Image \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --height 1024 --width 1024 \
    --num-inference-steps 50 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.10

Choosing delta

δCoarse steps (50 total)Denoising speedup
0.0126 @ 64² + 24 @ 128²1.53×
0.0535 @ 64² + 15 @ 128²2.03×
0.1042 @ 64² + 8 @ 128²2.33×
Z-Image achieves higher progressive speedups than FLUX.1 at the same δ because it uses dual CFG (two forward passes per step), doubling the absolute attention savings at coarse resolution. 0.10 is the recommended tradeoff.

Python API

from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="Tongyi-MAI/Z-Image",
    dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A serene mountain lake at golden hour, photorealistic",
    "num_inference_steps": 50,
    "height": 1024,
    "width": 1024,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.10,
})

Qwen-Image

Qwen-Image uses the same 2×2 patchify convention as FLUX.1 (in_channels=64, C=16), so the same progressive stage wires in with model-specific hooks for RoPE (freqs_cis) and spatial metadata (img_shapes).
# Standard fullres — unchanged behavior
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A serene mountain lake at golden hour"

# Progressive dct_rewind L1 δ=0.20 → 1.69× denoising speedup
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A serene mountain lake at golden hour" \
    --progressive-mode dct_rewind --progressive-levels 1 --progressive-delta 0.20 \
    --num-inference-steps 30 --dit-cpu-offload false
Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.
ConfigStage splitDenoiseSpeedup
Fullres (baseline)30 @ 128²43.00 s1.00×
dct_rewind L1 δ=0.0513@64² + 17@128²33.25 s1.29×
dct_rewind L1 δ=0.1016@64² + 14@128²33.86 s1.27×
dct_rewind L1 δ=0.2019@64² + 11@128²25.40 s1.69×

Limitations

  • Sequence parallelism incompatible. Cannot be combined with --ulysses-degree or --ring-degree. The stage raises a RuntimeError if SP is enabled.
  • torch.compile incompatible. Compiled kernels have a fixed sequence length; the resolution transition causes a recompile or error. Use progressive without --enable-torch-compile.
  • Cache-DiT interaction is experimental. The stage refreshes Cache-DiT context at resolution transitions, but quality and speedup should be benchmarked before relying on this combination.

References