Cache-DiT Acceleration#
SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 1.69x inference speedup with minimal quality loss.
Overview#
Cache-DiT uses intelligent caching strategies to skip redundant computation in the denoising loop:
DBCache (Dual Block Cache): Dynamically decides when to cache transformer blocks based on residual differences
TaylorSeer: Uses Taylor expansion for calibration to optimize caching decisions
SCM (Step Computation Masking): Step-level caching control for additional speedup
Basic Usage#
Enable Cache-DiT by exporting the environment variable and using sglang generate or sglang serve :
SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A beautiful sunset over the mountains"
Diffusers Backend Configuration#
Cache-DiT supports loading acceleration configs from a custom YAML file. For
diffusers pipelines, pass the YAML/JSON path via --cache-dit-config. This
flow requires cache-dit >= 1.2.0 (cache_dit.load_configs).
Single GPU inference#
Define a config.yaml file that contains:
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
Then apply the config with:
sglang generate --backend diffusers \
--model-path Qwen/Qwen-Image \
--cache-dit-config config.yaml \
--prompt "A beautiful sunset over the mountains"
Distributed inference#
Define a parallel_config.yaml file that contains:
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
parallelism_config:
ulysses_size: auto
parallel_kwargs:
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
ulysses_size: auto means cache-dit will auto-detect the world_size. Otherwise,
set it to a specific integer (e.g., 4).
Then apply the distributed config with:
sglang generate --backend diffusers \
--model-path Qwen/Qwen-Image \
--cache-dit-config parallel_config.yaml \
--prompt "A futuristic cityscape at sunset"
Advanced Configuration#
DBCache Parameters#
DBCache controls block-level caching behavior:
Parameter |
Env Variable |
Default |
Description |
|---|---|---|---|
Fn |
|
1 |
Number of first blocks to always compute |
Bn |
|
0 |
Number of last blocks to always compute |
W |
|
4 |
Warmup steps before caching starts |
R |
|
0.24 |
Residual difference threshold |
MC |
|
3 |
Maximum continuous cached steps |
TaylorSeer Configuration#
TaylorSeer improves caching accuracy using Taylor expansion:
Parameter |
Env Variable |
Default |
Description |
|---|---|---|---|
Enable |
|
false |
Enable TaylorSeer calibrator |
Order |
|
1 |
Taylor expansion order (1 or 2) |
Combined Configuration Example#
DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters simultaneously:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang generate --model-path black-forest-labs/FLUX.1-dev \
--prompt "A curious raccoon in a forest"
SCM (Step Computation Masking)#
SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and which to use cached results.
SCM Presets
SCM is configured with presets:
Preset |
Compute Ratio |
Speed |
Quality |
|---|---|---|---|
|
100% |
Baseline |
Best |
|
~75% |
~1.3x |
High |
|
~50% |
~2x |
Good |
|
~35% |
~3x |
Acceptable |
|
~25% |
~4x |
Lower |
Usage
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
Custom SCM Bins
For fine-grained control over which steps to compute vs cache:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
SCM Policy
Policy |
Env Variable |
Description |
|---|---|---|
|
|
Adaptive caching based on content (default) |
|
|
Fixed caching pattern |
Environment Variables#
All Cache-DiT parameters can be configured via environment variables. See Environment Variables for the complete list.
Supported Models#
SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:
Model Family |
Example Models |
|---|---|
Wan |
Wan2.1, Wan2.2 |
Flux |
FLUX.1-dev, FLUX.2-dev |
Z-Image |
Z-Image-Turbo |
Qwen |
Qwen-Image, Qwen-Image-Edit |
Hunyuan |
HunyuanVideo |
Performance Tips#
Start with defaults: The default parameters work well for most models
Use TaylorSeer: It typically improves both speed and quality
Tune R threshold: Lower values = better quality, higher values = faster
SCM for extra speed: Use
mediumpreset for good speed/quality balanceWarmup matters: Higher warmup = more stable caching decisions
Limitations#
SGLang-native pipelines: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically disabled when
world_size > 1.SCM minimum steps: SCM requires >= 8 inference steps to be effective
Model support: Only models registered in Cache-DiT’s BlockAdapterRegister are supported
Troubleshooting#
Distributed environment warning#
WARNING: cache-dit is disabled in distributed environment (world_size=N)
This is expected behavior. Cache-DiT currently only supports single-GPU inference.
SCM disabled for low step count#
For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache acceleration still works.