Overview
DiT attention is O(n²) in sequence length. Running the first N denoising steps at half the spatial resolution cuts the attention cost to ~6% for those steps. The transition point — how many steps to run at each resolution — is computed from the Bayes-optimal frequency-activation criterion: frequencies that cannot be resolved at the coarse scale are not denoised there. The method is designed to preserve quality under this criterion, but generated outputs can still differ from the full-resolution baseline.| Model | Full-res tokens | Half-res tokens | Token-step ratio |
|---|---|---|---|
| FLUX.1 1024×1024 | 4,096 | 1,024 | 4.0× |
| FLUX.2 1024×1024 | 4,096 | 1,024 | 4.0× |
| Z-Image 1024×1024 | 4,096 | 1,024 | 4.0× |
| Wan 2.1 T2V 480×832 (81 frames) | 6,240 | 1,560 | 4.0× |
Parameters
| Parameter | CLI flag | Default | Description |
|---|---|---|---|
progressive_mode | --progressive-mode | "fullres" | "fullres" disables (identical to standard generation). "dct_rewind" enables spectral upsample with scheduler rewind (recommended). "dct" enables upsample without rewind. |
progressive_levels | --progressive-levels | 1 | Number of resolution halvings. 1 = one coarse stage (64×64 latent → 128×128). 2 = two coarse stages (32×32 → 64×64 → 128×128). |
progressive_delta | --progressive-delta | 0.01 | Noise-dominated tolerance δ. Controls how many steps run at coarse resolution. Higher δ = more coarse steps = more speedup. |
Tip: Add --dit-cpu-offload false to keep the transformer GPU-resident. With CPU offload each step pays a fixed PCIe transfer cost regardless of sequence length, which dilutes the speedup.
FLUX.1
Usage
Choosing delta
| δ | Coarse steps (50 total) | Denoising speedup |
|---|---|---|
0.01 | 18 @ 64² + 32 @ 128² | 1.32× |
0.05 | 28 @ 64² + 22 @ 128² | 1.63× |
0.05 is recommended — it gives the largest speedup with no visible degradation.
Benchmark
Hardware: RTX A6000 48 GB,--dit-cpu-offload false. Timing = denoising loop only.
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 50 @ 128² latent | 36.65 s | 1.00× |
| dct_rewind L1 δ=0.01 | 18@64² + 32@128² | 27.67 s | 1.32× |
| dct_rewind L1 δ=0.05 | 28@64² + 22@128² | 22.58 s | 1.62× |
| dct_rewind L2 δ=0.01 | 10@32² + 8@64² + 32@128² | 26.48 s | 1.38× |
Python API
FLUX.2
SupportsFLUX.2-dev, FLUX.2-klein-4B, and FLUX.2-klein-9B.
Usage
Benchmark
Hardware: RTX A6000 48 GB,--dit-cpu-offload false. Model: FLUX.2-klein-4B, 30 steps, 1024×1024.
Timing = denoising loop only, averaged across 10 diverse prompts.
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 30 @ 64² latent | 9.72 s | 1.00× |
| dct_rewind L1 δ=0.05 | 18@32² + 12@64² | 5.50 s | 1.77× |
| dct_rewind L1 δ=0.10 | 20@32² + 10@64² | 5.03 s | 1.93× |
Python API
Wan 2.1 T2V
SupportsWan-AI/Wan2.1-T2V-1.3B-Diffusers and Wan-AI/Wan2.1-T2V-14B-Diffusers.
Note: Progressive generation grows only the spatial H×W dimensions. The temporal dimension T (number of latent frames) is kept fixed across all stages.
Usage
Choosing delta
| δ | Coarse steps (50 total) | Denoising speedup |
|---|---|---|
0.01 | 23 @ 30×52 + 27 @ 60×104 | 1.65× |
0.02 | 27 @ 30×52 + 23 @ 60×104 | 1.86× |
0.05 | 33 @ 30×52 + 17 @ 60×104 | 2.32× |
0.10 | 37 @ 30×52 + 13 @ 60×104 | 2.78× |
0.05 is recommended. 0.10 provides maximum speedup but should be validated on motion-heavy scenes.
Python API
Z-Image
SupportsTongyi-MAI/Z-Image. Z-Image uses the same VAE as FLUX.1 (FluxVAEConfig), so the power-law spectrum constants are identical. The progressive stage handles Z-Image’s 5-D latent format [B, C, 1, H, W] with squeeze/unsqueeze hooks and recomputes caption+image RoPE positional embeddings on each stage transition.
Note: Always specify --height 1024 --width 1024 (or another resolution where H_lat and W_lat are both divisible by 2). Z-Image’s default resolution (360×640) produces a 45×80 latent where H=45 is not divisible by the patch size.
Usage
Choosing delta
| δ | Coarse steps (50 total) | Denoising speedup |
|---|---|---|
0.01 | 26 @ 64² + 24 @ 128² | 1.53× |
0.05 | 35 @ 64² + 15 @ 128² | 2.03× |
0.10 | 42 @ 64² + 8 @ 128² | 2.33× |
0.10 is the recommended tradeoff.
Python API
Qwen-Image
Qwen-Image uses the same 2×2 patchify convention as FLUX.1 (in_channels=64, C=16), so the same progressive stage wires in with model-specific hooks for RoPE (freqs_cis) and spatial metadata (img_shapes).
--dit-cpu-offload false. Timing = denoising loop only.
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 30 @ 128² | 43.00 s | 1.00× |
| dct_rewind L1 δ=0.05 | 13@64² + 17@128² | 33.25 s | 1.29× |
| dct_rewind L1 δ=0.10 | 16@64² + 14@128² | 33.86 s | 1.27× |
| dct_rewind L1 δ=0.20 | 19@64² + 11@128² | 25.40 s | 1.69× |
Limitations
- Sequence parallelism incompatible. Cannot be combined with
--ulysses-degreeor--ring-degree. The stage raises aRuntimeErrorif SP is enabled. - torch.compile incompatible. Compiled kernels have a fixed sequence length; the resolution transition causes a recompile or error. Use progressive without
--enable-torch-compile. - Cache-DiT interaction is experimental. The stage refreshes Cache-DiT context at resolution transitions, but quality and speedup should be benchmarked before relying on this combination.
