Performance Optimization#

SGLang-Diffusion provides multiple performance optimization strategies to accelerate inference. This section covers all available performance tuning options.

Overview#

Optimization

Type

Description

Cache-DiT

Caching

Block-level caching with DBCache, TaylorSeer, and SCM

TeaCache

Caching

Timestep-level caching using L1 similarity

Attention Backends

Kernel

Optimized attention implementations (FlashAttention, SageAttention, etc.)

Profiling

Diagnostics

PyTorch Profiler and Nsight Systems guidance

Caching Strategies#

SGLang supports two complementary caching approaches:

Cache-DiT#

Cache-DiT provides block-level caching with advanced strategies. It can achieve up to 1.69x speedup.

Quick Start:

SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A beautiful sunset over the mountains"

Key Features:

  • DBCache: Dynamic block-level caching based on residual differences

  • TaylorSeer: Taylor expansion-based calibration for optimized caching

  • SCM: Step-level computation masking for additional speedup

See Cache-DiT Documentation for detailed configuration.

TeaCache#

TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.

Quick Overview:

  • Tracks L1 distance between modulated inputs across timesteps

  • When accumulated distance is below threshold, reuses cached residual

  • Supports CFG with separate positive/negative caches

Supported Models: Wan (wan2.1, wan2.2), Hunyuan (HunyuanVideo), Z-Image

See TeaCache Documentation for detailed configuration.

Attention Backends#

Different attention backends offer varying performance characteristics depending on your hardware and model:

  • FlashAttention: Fastest on NVIDIA GPUs with fp16/bf16

  • SageAttention: Alternative optimized implementation

  • xformers: Memory-efficient attention

  • SDPA: PyTorch native scaled dot-product attention

See Attention Backends for platform support and configuration options.

Profiling#

To diagnose performance bottlenecks, SGLang-Diffusion supports profiling tools:

  • PyTorch Profiler: Built-in Python profiling

  • Nsight Systems: GPU kernel-level analysis

See Profiling Guide for detailed instructions.

References#