Cache-DiT Acceleration - SGLang Documentation

SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 1.69x inference speedup with minimal quality loss.

Overview

Cache-DiT uses intelligent caching strategies to skip redundant computation in the denoising loop:

DBCache (Dual Block Cache): Dynamically decides when to cache transformer blocks based on residual differences
TaylorSeer: Uses Taylor expansion for calibration to optimize caching decisions
SCM (Step Computation Masking): Step-level caching control for additional speedup

Basic Usage

Enable Cache-DiT by exporting the environment variable and using sglang generate or sglang serve :

SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A beautiful sunset over the mountains"

Diffusers Backend

Cache-DiT supports loading acceleration configs from a custom YAML file. For diffusers pipelines (diffusers backend), pass the YAML/JSON path via --cache-dit-config. This flow requires cache-dit >= 1.2.0 (cache_dit.load_configs).

Single GPU inference

Define a cache.yaml file that contains:

DBCache + TaylorSeer

cache_config:
  max_warmup_steps: 8
  warmup_interval: 2
  max_cached_steps: -1
  max_continuous_cached_steps: 2
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1

Then apply the config with:

sglang generate \
  --backend diffusers \
  --model-path Qwen/Qwen-Image \
  --cache-dit-config cache.yaml \
  --prompt "A beautiful sunset over the mountains"

DBCache + TaylorSeer + SCM (Step Computation Mask)

Config

cache_config:
  max_warmup_steps: 8
  warmup_interval: 2
  max_cached_steps: -1
  max_continuous_cached_steps: 2
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
  # Must set the num_inference_steps for SCM. The SCM will automatically
  # generate the steps computation mask based on the num_inference_steps.
  # Reference: https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/#scm-steps-computation-masking
  num_inference_steps: 28
  steps_computation_mask: fast

DBCache + TaylorSeer + SCM (Step Computation Mask) + Cache CFG

Config

cache_config:
  max_warmup_steps: 8
  warmup_interval: 2
  max_cached_steps: -1
  max_continuous_cached_steps: 2
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
  num_inference_steps: 28
  steps_computation_mask: fast
  enable_sperate_cfg: true # e.g, Qwen-Image, Wan, Chroma, Ovis-Image, etc.

Distributed inference

1D Parallelism

Define a parallelism only config yaml parallel.yaml file that contains:

Config

parallelism_config:
  ulysses_size: auto
  attention_backend: native

Then, apply the distributed inference acceleration config from yaml. ulysses_size: auto means that cache-dit will auto detect the world_size as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4. Then apply the distributed config with: (Note: please add --num-gpus N to specify the number of gpus for distributed inference)

sglang generate \
  --backend diffusers \
  --num-gpus 4 \
  --model-path Qwen/Qwen-Image \
  --cache-dit-config parallel.yaml \
  --prompt "A futuristic cityscape at sunset"

2D Parallelism

You can also define a 2D parallelism config yaml parallel_2d.yaml file that contains:

Config

parallelism_config:
  ulysses_size: auto
  tp_size: 2
  attention_backend: native

Then, apply the 2D parallelism config from yaml. Here tp_size: 2 means using tensor parallelism with size 2. The ulysses_size: auto means that cache-dit will auto detect the world_size // tp_size as the ulysses_size.

3D Parallelism

You can also define a 3D parallelism config yaml parallel_3d.yaml file that contains:

Config

parallelism_config:
  ulysses_size: 2
  ring_size: 2
  tp_size: 2
  attention_backend: native

Then, apply the 3D parallelism config from yaml. Here ulysses_size: 2, ring_size: 2, tp_size: 2 means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.

Ulysses Anything Attention

To enable Ulysses Anything Attention, you can define a parallelism config yaml parallel_uaa.yaml file that contains:

Config

parallelism_config:
  ulysses_size: auto
  attention_backend: native
  ulysses_anything: true

Ulysses FP8 Communication

For device that don’t have NVLink support, you can enable Ulysses FP8 Communication to further reduce the communication overhead. You can define a parallelism config yaml parallel_fp8.yaml file that contains:

Config

parallelism_config:
  ulysses_size: auto
  attention_backend: native
  ulysses_float8: true

Async Ulysses CP

You can also enable async ulysses CP to overlap the communication and computation. Define a parallelism config yaml parallel_async.yaml file that contains:

Config

parallelism_config:
  ulysses_size: auto
  attention_backend: native
  ulysses_async: true # Now, only support for FLUX.1, Qwen-Image, Ovis-Image and Z-Image.

Then, apply the config from yaml. Here ulysses_async: true means enabling async ulysses CP.

TE-P and VAE-P

You can also specify the extra parallel modules in the yaml config. For example, define a parallelism config yaml parallel_extra.yaml file that contains:

Config

parallelism_config:
  ulysses_size: auto
  attention_backend: native
  extra_parallel_modules: ["text_encoder", "vae"]

Hybrid Cache and Parallelism

Define a hybrid cache and parallel acceleration config yaml hybrid.yaml file that contains:

Config

cache_config:
  max_warmup_steps: 8
  warmup_interval: 2
  max_cached_steps: -1
  max_continuous_cached_steps: 2
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
parallelism_config:
  ulysses_size: auto
  attention_backend: native
  extra_parallel_modules: ["text_encoder", "vae"]

Then, apply the hybrid cache and parallel acceleration config from yaml.

sglang generate \
  --backend diffusers \
  --num-gpus 4 \
  --model-path Qwen/Qwen-Image \
  --cache-dit-config hybrid.yaml \
  --prompt "A beautiful sunset over the mountains"

Attention Backend

In some cases, users may want to only specify the attention backend without any other optimization configs. In this case, you can define a yaml file attention.yaml that only contains:

Config

attention_backend: "flash" # '_flash_3' for Hopper

Quantization

You can also specify the quantization config in the yaml file, required torchao>=0.16.0. For example, define a yaml file quantize.yaml that contains:

Config

quantize_config: # quantization configuration for transformer modules
  # float8 (DQ), float8_weight_only, float8_blockwise, int8 (DQ), int8_weight_only, etc.
  quant_type: "float8"
  # layers to exclude from quantization (transformer). layers that contains any of the
  # keywords in the exclude_layers list will be excluded from quantization. This is useful
  # for some sensitive layers that are not robust to quantization, e.g., embedding layers.
  exclude_layers:
    - "embedder"
    - "embed"
  verbose: false # whether to print verbose logs during quantization

Then, apply the quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization. For example:

Command

sglang generate \
  --backend diffusers \
  --model-path Qwen/Qwen-Image \
  --warmup \
  --cache-dit-config quantize.yaml \
  --enable-torch-compile \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --prompt "A beautiful sunset over the mountains"

Combined Configs: Cache + Parallelism + Quantization

You can also combine all the above configs together in a single yaml file combined.yaml that contains:

Config

cache_config:
  max_warmup_steps: 8
  warmup_interval: 2
  max_cached_steps: -1
  max_continuous_cached_steps: 2
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
parallelism_config:
  ulysses_size: auto
  attention_backend: native
  extra_parallel_modules: ["text_encoder", "vae"]
quantize_config:
  quant_type: "float8"
  exclude_layers:
    - "embedder"
    - "embed"
  verbose: false

Then, apply the combined cache, parallelism and quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization.

Advanced Configuration

DBCache Parameters

DBCache controls block-level caching behavior:

Parameter	Env Variable	Default	Description
Fn	`SGLANG_CACHE_DIT_FN`	1	Number of first blocks to always compute
Bn	`SGLANG_CACHE_DIT_BN`	0	Number of last blocks to always compute
W	`SGLANG_CACHE_DIT_WARMUP`	4	Warmup steps before caching starts
R	`SGLANG_CACHE_DIT_RDT`	0.24	Residual difference threshold
MC	`SGLANG_CACHE_DIT_MC`	3	Maximum continuous cached steps

TaylorSeer Configuration

TaylorSeer improves caching accuracy using Taylor expansion:

Parameter	Env Variable	Default	Description
Enable	`SGLANG_CACHE_DIT_TAYLORSEER`	false	Enable TaylorSeer calibrator
Order	`SGLANG_CACHE_DIT_TS_ORDER`	1	Taylor expansion order (1 or 2)

Combined Configuration Example

DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters simultaneously:

SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang generate --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A curious raccoon in a forest"

SCM (Step Computation Masking)

SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and which to use cached results. SCM Presets SCM is configured with presets:

Preset	Compute Ratio	Speed	Quality
`none`	100%	Baseline	Best
`slow`	~75%	~1.3x	High
`medium`	~50%	~2x	Good
`fast`	~35%	~3x	Acceptable
`ultra`	~25%	~4x	Lower

Usage

SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A futuristic cityscape at sunset"

Custom SCM Bins For fine-grained control over which steps to compute vs cache:

SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A futuristic cityscape at sunset"

SCM Policy

Policy	Env Variable	Description
`dynamic`	`SGLANG_CACHE_DIT_SCM_POLICY=dynamic`	Adaptive caching based on content (default)
`static`	`SGLANG_CACHE_DIT_SCM_POLICY=static`	Fixed caching pattern

Environment Variables

All Cache-DiT parameters can be configured via environment variables. See Environment Variables for the complete list.

Supported Models

SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:

Model Family	Example Models
Wan	Wan2.1, Wan2.2
Flux	FLUX.1-dev, FLUX.2-dev
Z-Image	Z-Image-Turbo
Qwen	Qwen-Image, Qwen-Image-Edit
Hunyuan	HunyuanVideo

Performance Tips

Start with defaults: The default parameters work well for most models
Use TaylorSeer: It typically improves both speed and quality
Tune R threshold: Lower values = better quality, higher values = faster
SCM for extra speed: Use medium preset for good speed/quality balance
Warmup matters: Higher warmup = more stable caching decisions

Limitations

SGLang-native pipelines: Distributed Cache-DiT paths exist for supported pipelines. Hybrid SP+TP configurations add communication and cache coordination overhead, so validate them on the target model and hardware before using them as production defaults.
SCM minimum steps: SCM requires >= 8 inference steps to be effective
Model support: Only models registered in Cache-DiT’s BlockAdapterRegister are supported

Troubleshooting

SCM disabled for low step count

For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache acceleration still works.

​Overview

​Basic Usage

​Diffusers Backend

​Single GPU inference

​Distributed inference

​Hybrid Cache and Parallelism

​Attention Backend

​Quantization

​Combined Configs: Cache + Parallelism + Quantization

​Advanced Configuration

​DBCache Parameters

​TaylorSeer Configuration

​Combined Configuration Example

​SCM (Step Computation Masking)

​Environment Variables

​Supported Models

​Performance Tips

​Limitations

​Troubleshooting

​SCM disabled for low step count

​References

Overview

Basic Usage

Diffusers Backend

Single GPU inference

Distributed inference

Hybrid Cache and Parallelism

Attention Backend

Quantization

Combined Configs: Cache + Parallelism + Quantization

Advanced Configuration

DBCache Parameters

TaylorSeer Configuration

Combined Configuration Example

SCM (Step Computation Masking)

Environment Variables

Supported Models

Performance Tips

Limitations

Troubleshooting

SCM disabled for low step count

References