Caching Acceleration for Diffusion Models

Caching Acceleration for Diffusion Models#

SGLang provides multiple caching acceleration strategies for Diffusion Transformer (DiT) models. These strategies can significantly reduce inference time by skipping redundant computation.

Overview#

SGLang supports two complementary caching approaches:

Strategy

Scope

Mechanism

Best For

Cache-DiT

Block-level

Skip individual transformer blocks dynamically

Advanced, higher speedup

TeaCache

Timestep-level

Skip entire denoising steps based on L1 similarity

Simple, built-in

Cache-DiT#

Cache-DiT provides block-level caching with advanced strategies like DBCache and TaylorSeer. It can achieve up to 1.69x speedup.

See cache_dit.md for detailed configuration.

Quick Start#

SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A beautiful sunset over the mountains"

Key Features#

  • DBCache: Dynamic block-level caching based on residual differences

  • TaylorSeer: Taylor expansion-based calibration for optimized caching

  • SCM: Step-level computation masking for additional speedup

TeaCache#

TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.

See teacache.md for detailed documentation.

Quick Overview#

  • Tracks L1 distance between modulated inputs across timesteps

  • When accumulated distance is below threshold, reuses cached residual

  • Supports CFG with separate positive/negative caches

Supported Models#

  • Wan (wan2.1, wan2.2)

  • Hunyuan (HunyuanVideo)

  • Z-Image

For Flux and Qwen models, TeaCache is automatically disabled when CFG is enabled.

References#