Caching Acceleration for Diffusion Models#
SGLang provides multiple caching acceleration strategies for Diffusion Transformer (DiT) models. These strategies can significantly reduce inference time by skipping redundant computation.
Overview#
SGLang supports two complementary caching approaches:
Strategy |
Scope |
Mechanism |
Best For |
|---|---|---|---|
Cache-DiT |
Block-level |
Skip individual transformer blocks dynamically |
Advanced, higher speedup |
TeaCache |
Timestep-level |
Skip entire denoising steps based on L1 similarity |
Simple, built-in |
Cache-DiT#
Cache-DiT provides block-level caching with advanced strategies like DBCache and TaylorSeer. It can achieve up to 1.69x speedup.
See cache_dit.md for detailed configuration.
Quick Start#
SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A beautiful sunset over the mountains"
Key Features#
DBCache: Dynamic block-level caching based on residual differences
TaylorSeer: Taylor expansion-based calibration for optimized caching
SCM: Step-level computation masking for additional speedup
TeaCache#
TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.
See teacache.md for detailed documentation.
Quick Overview#
Tracks L1 distance between modulated inputs across timesteps
When accumulated distance is below threshold, reuses cached residual
Supports CFG with separate positive/negative caches
Supported Models#
Wan (wan2.1, wan2.2)
Hunyuan (HunyuanVideo)
Z-Image
For Flux and Qwen models, TeaCache is automatically disabled when CFG is enabled.