> ## Documentation Index > Fetch the complete documentation index at: https://docs.sglang.io/llms.txt > Use this file to discover all available pages before exploring further. # Performance Optimization > Choose performance levers for SGLang Diffusion by latency, throughput, memory, and quality tradeoffs. Use this page as the starting point for SGLang Diffusion performance work. It separates performance levers into two decision classes: * **Output-preserving / lossless-style:** system settings that should preserve model behavior while changing residency, parallelism, kernels, or scheduling. * **Quality-tradeoff / lossy or approximate:** techniques that can change the denoising path, numerical representation, or generated output. The docs use "output-preserving" instead of promising bit-exact "lossless" because different kernels, GPU types, or precision paths can still introduce small numerical differences. The decision boundary is whether the optimization intentionally trades quality or output equivalence for speed. ## Start Here 1. Pick a serving or generation mode from [Deployment and Performance Modes](./deployment_cookbook). `--performance-mode auto` is the default; use `speed` when the model fits in GPU memory and latency matters most, `memory` when GPU memory is the bottleneck, and `manual` when every performance flag should be explicit. 2. Choose the right attention backend from [Attention Backends](./attention_backends). 3. Use [Sequence Parallelism](./ring_sp_performance) only when the model and video shape benefit from sequence splitting. 4. Use [Inference Batching](./dynamic_batching) for concurrent compatible requests during serving. 5. Use [Profiling](./profiling) before changing several levers at once. ## Output-Preserving / Lossless-Style Levers These settings should preserve model behavior while changing residency, parallelism, kernels, or scheduling. They are the first choices for production tuning.

Lever	Use when	Docs
`--performance-mode`	You want a safe preset for speed or memory without overriding explicit flags.	Deployment and Performance Modes
Offload, FSDP, CFG parallelism	GPU memory, multi-GPU residency, or CFG branch splitting is the main bottleneck.	Deployment and Performance Modes
Sequence parallelism	Long image/video sequences need sequence-level parallelism.	Sequence Parallelism
Attention backend	Kernel choice dominates DiT latency or memory.	Attention Backends
Dynamic batching	Serving many compatible requests concurrently.	Inference Batching

## Quality-Tradeoff / Lossy Or Approximate Levers These techniques can change the denoising path, numerical representation, or generated output. They are useful after you have a baseline and an acceptance criterion for quality.

Lever	Tradeoff	Docs
Cache-DiT	Skips selected DiT block or step computation based on cache decisions.	Cache-DiT
TeaCache	Reuses residuals when consecutive denoising steps are similar enough.	TeaCache
Progressive resolution	Runs early denoising at lower latent resolution for supported pipelines.	Progressive Resolution Generation
Quantization	Uses lower-precision transformer weights or activations.	Quantization

## Practical Order 1. Establish a baseline with the target model, resolution, frame count, step count, and GPU type. 2. Select `--performance-mode` and explicit residency or parallelism flags. 3. Tune attention backend and batching for the deployment pattern. 4. Profile if the bottleneck is unclear. 5. Add caching, progressive resolution, or quantization only after comparing output quality against your acceptance target. ## Diagnostics [Profiling](./profiling) is not an optimization technique by itself. It belongs in the performance workflow because it tells you which stage, kernel, or denoising step is worth optimizing before you change multiple levers. ## References * [Deployment and Performance Modes](./deployment_cookbook) * [Attention Backends](./attention_backends) * [Sequence Parallelism](./ring_sp_performance) * [Caching Strategies](./caching-acceleration) * [Profiling](./profiling)