Performance Optimization - SGLang Documentation

Use this page as the starting point for SGLang Diffusion performance work. It separates performance levers into two decision classes:

Output-preserving / lossless-style: system settings that should preserve model behavior while changing residency, parallelism, kernels, or scheduling.
Quality-tradeoff / lossy or approximate: techniques that can change the denoising path, numerical representation, or generated output.

The docs use “output-preserving” instead of promising bit-exact “lossless” because different kernels, GPU types, or precision paths can still introduce small numerical differences. The decision boundary is whether the optimization intentionally trades quality or output equivalence for speed.

Start Here

Pick a serving or generation mode from Deployment and Performance Modes. --performance-mode auto is the default; use speed when the model fits in GPU memory and latency matters most, memory when GPU memory is the bottleneck, and manual when every performance flag should be explicit.
Choose the right attention backend from Attention Backends.
Use Sequence Parallelism only when the model and video shape benefit from sequence splitting.
Use Inference Batching for concurrent compatible requests during serving.
Use Profiling before changing several levers at once.

Output-Preserving / Lossless-Style Levers

These settings should preserve model behavior while changing residency, parallelism, kernels, or scheduling. They are the first choices for production tuning.

Lever	Use when	Docs
`—performance-mode`	You want a safe preset for speed or memory without overriding explicit flags.	Deployment and Performance Modes
Offload, FSDP, CFG parallelism	GPU memory, multi-GPU residency, or CFG branch splitting is the main bottleneck.	Deployment and Performance Modes
Sequence parallelism	Long image/video sequences need sequence-level parallelism.	Sequence Parallelism
Attention backend	Kernel choice dominates DiT latency or memory.	Attention Backends
Dynamic batching	Serving many compatible requests concurrently.	Inference Batching

Quality-Tradeoff / Lossy Or Approximate Levers

These techniques can change the denoising path, numerical representation, or generated output. They are useful after you have a baseline and an acceptance criterion for quality.

Lever	Tradeoff	Docs
Cache-DiT	Skips selected DiT block or step computation based on cache decisions.	Cache-DiT
TeaCache	Reuses residuals when consecutive denoising steps are similar enough.	TeaCache
Progressive resolution	Runs early denoising at lower latent resolution for supported pipelines.	Progressive Resolution Generation
Quantization	Uses lower-precision transformer weights or activations.	Quantization

Practical Order

Establish a baseline with the target model, resolution, frame count, step count, and GPU type.
Select --performance-mode and explicit residency or parallelism flags.
Tune attention backend and batching for the deployment pattern.
Profile if the bottleneck is unclear.
Add caching, progressive resolution, or quantization only after comparing output quality against your acceptance target.

Diagnostics

Profiling is not an optimization technique by itself. It belongs in the performance workflow because it tells you which stage, kernel, or denoising step is worth optimizing before you change multiple levers.

​Start Here

​Output-Preserving / Lossless-Style Levers

​Quality-Tradeoff / Lossy Or Approximate Levers

​Practical Order

​Diagnostics

​References

Start Here

Output-Preserving / Lossless-Style Levers

Quality-Tradeoff / Lossy Or Approximate Levers

Practical Order

Diagnostics

References