Diffusion Models#
SGLang Diffusion is an inference framework for accelerated image and video generation using diffusion models. It provides an end-to-end unified pipeline with optimized kernels from sgl-kernel and an efficient scheduler loop.
Key Features#
Broad Model Support: Wan series, FastWan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux, Z-Image, GLM-Image, and more
Fast Inference: Optimized kernels from sgl-kernel, efficient scheduler loop, and Cache-DiT acceleration
Ease of Use: OpenAI-compatible API, CLI, and Python SDK
Multi-Platform: NVIDIA GPUs (H100, H200, A100, B200, 4090) and AMD GPUs (MI300X, MI325X)
Install SGLang-diffusion#
You can install sglang-diffusion using one of the methods below.
This page primarily applies to common NVIDIA GPU platforms. For AMD Instinct/ROCm environments see the dedicated ROCm quickstart, which lists the exact steps (including kernel builds) we used to validate sgl-diffusion on MI300X.
Method 1: With pip or uv#
It is recommended to use uv for a faster installation:
pip install --upgrade pip
pip install uv
uv pip install "sglang[diffusion]" --prerelease=allow
Method 2: From source#
# Use the latest release branch
git clone https://github.com/sgl-project/sglang.git
cd sglang
# Install the Python packages
pip install --upgrade pip
pip install -e "python[diffusion]"
# With uv
uv pip install -e "python[diffusion]" --prerelease=allow
Method 3: Using Docker#
The Docker images are available on Docker Hub at lmsysorg/sglang, built from the Dockerfile.
Replace <secret> below with your HuggingFace Hub token.
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:dev \
sglang generate --model-path black-forest-labs/FLUX.1-dev \
--prompt "A logo With Bold Large text: SGL Diffusion" \
--save-output
ROCm quickstart for sgl-diffusion#
docker run --device=/dev/kfd --device=/dev/dri --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env HF_TOKEN=<secret> \
lmsysorg/sglang:v0.5.5.post2-rocm700-mi30x \
sglang generate --model-path black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output
Compatibility Matrix#
The table below shows every supported model and the optimizations supported for them.
The symbols used have the following meanings:
✅ = Full compatibility
❌ = No compatibility
⭕ = Does not apply to this model
Models x Optimization#
The HuggingFace Model ID can be passed directly to from_pretrained() methods, and sglang-diffusion will use the
optimal
default parameters when initializing and generating videos.
Video Generation Models#
Model Name |
Hugging Face Model ID |
Resolutions |
TeaCache |
Sliding Tile Attn |
Sage Attn |
Video Sparse Attention (VSA) |
|---|---|---|---|---|---|---|
FastWan2.1 T2V 1.3B |
|
480p |
⭕ |
⭕ |
⭕ |
✅ |
FastWan2.2 TI2V 5B Full Attn |
|
720p |
⭕ |
⭕ |
⭕ |
✅ |
Wan2.2 TI2V 5B |
|
720p |
⭕ |
⭕ |
✅ |
⭕ |
Wan2.2 T2V A14B |
|
480p |
❌ |
❌ |
✅ |
⭕ |
Wan2.2 I2V A14B |
|
480p |
❌ |
❌ |
✅ |
⭕ |
HunyuanVideo |
|
720×1280 |
❌ |
✅ |
✅ |
⭕ |
FastHunyuan |
|
720×1280 |
❌ |
✅ |
✅ |
⭕ |
Wan2.1 T2V 1.3B |
|
480p |
✅ |
✅ |
✅ |
⭕ |
Wan2.1 T2V 14B |
|
480p, 720p |
✅ |
✅ |
✅ |
⭕ |
Wan2.1 I2V 480P |
|
480p |
✅ |
✅ |
✅ |
⭕ |
Wan2.1 I2V 720P |
|
720p |
✅ |
✅ |
✅ |
⭕ |
Note: Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.
Image Generation Models#
Model Name |
HuggingFace Model ID |
Resolutions |
|---|---|---|
FLUX.1-dev |
|
Any resolution |
FLUX.2-dev |
|
Any resolution |
FLUX.2-Klein |
|
Any resolution |
Z-Image-Turbo |
|
Any resolution |
GLM-Image |
|
Any resolution |
Qwen Image |
|
Any resolution |
Qwen Image 2512 |
|
Any resolution |
Qwen Image Edit |
|
Any resolution |
Verified LoRA Examples#
This section lists example LoRAs that have been explicitly tested and verified with each base model in the SGLang Diffusion pipeline.
Important:
LoRAs that are not listed here are not necessarily incompatible. In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions. The entries below simply reflect configurations that have been manually validated by the SGLang team.
Verified LoRAs by Base Model#
Base Model |
Supported LoRAs |
|---|---|
Wan2.2 |
|
Wan2.1 |
|
Z-Image-Turbo |
|
Qwen-Image |
|
Qwen-Image-Edit |
|
Flux |
|
Special Requirements#
[!NOTE] Sliding Tile Attention: Currently, only Hopper GPUs (H100s) are supported.
SGLang diffusion CLI Inference#
The SGLang-diffusion CLI provides a quick way to access the inference pipeline for image and video generation.
Prerequisites#
A working SGLang diffusion installation and the
sglangCLI available in$PATH.Python 3.11+ if you plan to use the OpenAI Python SDK.
Supported Arguments#
Server Arguments#
--model-path {MODEL_PATH}: Path to the model or model ID--vae-path {VAE_PATH}: Path to a custom VAE model or HuggingFace model ID (e.g.,fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path.--lora-path {LORA_PATH}: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied.--lora-nickname {NAME}: Nickname for the LoRA adapter. (default:default).--num-gpus {NUM_GPUS}: Number of GPUs to use--tp-size {TP_SIZE}: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)--sp-degree {SP_SIZE}: Sequence parallelism size (typically should match the number of GPUs)--ulysses-degree {ULYSSES_DEGREE}: The degree of DeepSpeed-Ulysses-style SP in USP--ring-degree {RING_DEGREE}: The degree of ring attention-style SP in USP
Sampling Parameters#
--prompt {PROMPT}: Text description for the video you want to generate--num-inference-steps {STEPS}: Number of denoising steps--negative-prompt {PROMPT}: Negative prompt to guide generation away from certain concepts--seed {SEED}: Random seed for reproducible generation
Image/Video Configuration#
--height {HEIGHT}: Height of the generated output--width {WIDTH}: Width of the generated output--num-frames {NUM_FRAMES}: Number of frames to generate--fps {FPS}: Frames per second for the saved output, if this is a video-generation task
Output Options#
--output-path {PATH}: Directory to save the generated video--save-output: Whether to save the image/video to disk--return-frames: Whether to return the raw frames
Using Configuration Files#
Instead of specifying all parameters on the command line, you can use a configuration file:
sglang generate --config {CONFIG_FILE_PATH}
The configuration file should be in JSON or YAML format with the same parameter names as the CLI options. Command-line arguments take precedence over settings in the configuration file, allowing you to override specific values while keeping the rest from the configuration file.
Example configuration file (config.json):
{
"model_path": "FastVideo/FastHunyuan-diffusers",
"prompt": "A beautiful woman in a red dress walking down a street",
"output_path": "outputs/",
"num_gpus": 2,
"sp_size": 2,
"tp_size": 1,
"num_frames": 45,
"height": 720,
"width": 1280,
"num_inference_steps": 6,
"seed": 1024,
"fps": 24,
"precision": "bf16",
"vae_precision": "fp16",
"vae_tiling": true,
"vae_sp": true,
"vae_config": {
"load_encoder": false,
"load_decoder": true,
"tile_sample_min_height": 256,
"tile_sample_min_width": 256
},
"text_encoder_precisions": [
"fp16",
"fp16"
],
"mask_strategy_file_path": null,
"enable_torch_compile": false
}
Or using YAML format (config.yaml):
model_path: "FastVideo/FastHunyuan-diffusers"
prompt: "A beautiful woman in a red dress walking down a street"
output_path: "outputs/"
num_gpus: 2
sp_size: 2
tp_size: 1
num_frames: 45
height: 720
width: 1280
num_inference_steps: 6
seed: 1024
fps: 24
precision: "bf16"
vae_precision: "fp16"
vae_tiling: true
vae_sp: true
vae_config:
load_encoder: false
load_decoder: true
tile_sample_min_height: 256
tile_sample_min_width: 256
text_encoder_precisions:
- "fp16"
- "fp16"
mask_strategy_file_path: null
enable_torch_compile: false
To see all the options, you can use the --help flag:
sglang generate --help
Serve#
Launch the SGLang diffusion HTTP server and interact with it using the OpenAI SDK and curl.
Start the server#
Use the following command to launch the server:
SERVER_ARGS=(
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
--text-encoder-cpu-offload
--pin-cpu-memory
--num-gpus 4
--ulysses-degree=2
--ring-degree=2
)
sglang serve "${SERVER_ARGS[@]}"
–model-path: Which model to load. The example uses
Wan-AI/Wan2.1-T2V-1.3B-Diffusers.–port: HTTP port to listen on (the default here is
30010).
For detailed API usage, including Image, Video Generation and LoRA management, please refer to the OpenAI API Documentation.
Generate#
Run a one-off generation task without launching a persistent server.
To use it, pass both server arguments and sampling parameters in one command, after the generate subcommand, for example:
SERVER_ARGS=(
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
--text-encoder-cpu-offload
--pin-cpu-memory
--num-gpus 4
--ulysses-degree=2
--ring-degree=2
)
SAMPLING_ARGS=(
--prompt "A curious raccoon"
--save-output
--output-path outputs
--output-file-name "A curious raccoon.mp4"
)
sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
# Or, users can set `SGLANG_CACHE_DIT_ENABLED` env as `true` to enable cache acceleration
SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
Once the generation task has finished, the server will shut down automatically.
[!NOTE] The HTTP server-related arguments are ignored in this subcommand.
Diffusers Backend#
SGLang diffusion supports a diffusers backend that allows you to run any diffusers-compatible model through SGLang’s infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes.
Arguments#
Argument |
Values |
Description |
|---|---|---|
|
|
|
|
|
Attention backend for diffusers pipelines. See diffusers attention backends. |
|
flag |
Required for models with custom pipeline classes (e.g., Ovis). |
|
flag |
Enable VAE tiling for large image support (decodes tile-by-tile). |
|
flag |
Enable VAE slicing for lower memory usage (decodes slice-by-slice). |
|
|
Precision for the diffusion transformer. |
|
|
Precision for the VAE. |
Example: Running Ovis-Image-7B#
Ovis-Image-7B is a 7B text-to-image model optimized for high-quality text rendering.
sglang generate \
--model-path AIDC-AI/Ovis-Image-7B \
--backend diffusers \
--trust-remote-code \
--diffusers-attention-backend flash \
--prompt "A serene Japanese garden with cherry blossoms" \
--height 1024 \
--width 1024 \
--num-inference-steps 30 \
--save-output \
--output-path outputs \
--output-file-name ovis_garden.png
Extra Diffusers Arguments#
For pipeline-specific parameters not exposed via CLI, use diffusers_kwargs in a config file:
{
"model_path": "AIDC-AI/Ovis-Image-7B",
"backend": "diffusers",
"prompt": "A beautiful landscape",
"diffusers_kwargs": {
"cross_attention_kwargs": {"scale": 0.5}
}
}
sglang generate --config config.json
SGLang Diffusion OpenAI API#
The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management.
Serve#
Launch the server using the sglang serve command.
Start the server#
SERVER_ARGS=(
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
--text-encoder-cpu-offload
--pin-cpu-memory
--num-gpus 4
--ulysses-degree=2
--ring-degree=2
--port 30010
)
sglang serve "${SERVER_ARGS[@]}"
–model-path: Path to the model or model ID.
–port: HTTP port to listen on (default:
30000).
Get Model Information#
Endpoint: GET /models
Returns information about the model served by this server, including model path, task type, pipeline configuration, and precision settings.
Curl Example:
curl -sS -X GET "http://localhost:30010/models"
Response Example:
{
"model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"task_type": "T2V",
"pipeline_name": "wan_pipeline",
"pipeline_class": "WanPipeline",
"num_gpus": 4,
"dit_precision": "bf16",
"vae_precision": "fp16"
}
Endpoints#
Image Generation#
The server implements an OpenAI-compatible Images API under the /v1/images namespace.
Create an image#
Endpoint: POST /v1/images/generations
Python Example (b64_json response):
import base64
from openai import OpenAI
client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")
img = client.images.generate(
prompt="A calico cat playing a piano on stage",
size="1024x1024",
n=1,
response_format="b64_json",
)
image_bytes = base64.b64decode(img.data[0].b64_json)
with open("output.png", "wb") as f:
f.write(image_bytes)
Curl Example:
curl -sS -X POST "http://localhost:30010/v1/images/generations" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-proj-1234567890" \
-d '{
"prompt": "A calico cat playing a piano on stage",
"size": "1024x1024",
"n": 1,
"response_format": "b64_json"
}'
Note The
response_format=urloption is not supported forPOST /v1/images/generationsand will return a400error.
Edit an image#
Endpoint: POST /v1/images/edits
This endpoint accepts a multipart form upload with input images and a text prompt. The server can return either a base64-encoded image or a URL to download the image.
Curl Example (b64_json response):
curl -sS -X POST "http://localhost:30010/v1/images/edits" \
-H "Authorization: Bearer sk-proj-1234567890" \
-F "image=@local_input_image.png" \
-F "url=image_url.jpg" \
-F "prompt=A calico cat playing a piano on stage" \
-F "size=1024x1024" \
-F "response_format=b64_json"
Curl Example (URL response):
curl -sS -X POST "http://localhost:30010/v1/images/edits" \
-H "Authorization: Bearer sk-proj-1234567890" \
-F "image=@local_input_image.png" \
-F "url=image_url.jpg" \
-F "prompt=A calico cat playing a piano on stage" \
-F "size=1024x1024" \
-F "response_format=url"
Download image content#
When response_format=url is used with POST /v1/images/edits, the API returns a relative URL like /v1/images/<IMAGE_ID>/content.
Endpoint: GET /v1/images/{image_id}/content
Curl Example:
curl -sS -L "http://localhost:30010/v1/images/<IMAGE_ID>/content" \
-H "Authorization: Bearer sk-proj-1234567890" \
-o output.png
Video Generation#
The server implements a subset of the OpenAI Videos API under the /v1/videos namespace.
Create a video#
Endpoint: POST /v1/videos
Python Example:
from openai import OpenAI
client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")
video = client.videos.create(
prompt="A calico cat playing a piano on stage",
size="1280x720"
)
print(f"Video ID: {video.id}, Status: {video.status}")
Curl Example:
curl -sS -X POST "http://localhost:30010/v1/videos" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-proj-1234567890" \
-d '{
"prompt": "A calico cat playing a piano on stage",
"size": "1280x720"
}'
List videos#
Endpoint: GET /v1/videos
Python Example:
videos = client.videos.list()
for item in videos.data:
print(item.id, item.status)
Curl Example:
curl -sS -X GET "http://localhost:30010/v1/videos" \
-H "Authorization: Bearer sk-proj-1234567890"
Download video content#
Endpoint: GET /v1/videos/{video_id}/content
Python Example:
import time
# Poll for completion
while True:
page = client.videos.list()
item = next((v for v in page.data if v.id == video_id), None)
if item and item.status == "completed":
break
time.sleep(5)
# Download content
resp = client.videos.download_content(video_id=video_id)
with open("output.mp4", "wb") as f:
f.write(resp.read())
Curl Example:
curl -sS -L "http://localhost:30010/v1/videos/<VIDEO_ID>/content" \
-H "Authorization: Bearer sk-proj-1234567890" \
-o output.mp4
LoRA Management#
The server supports dynamic loading, merging, and unmerging of LoRA adapters.
Important Notes:
Mutual Exclusion: Only one LoRA can be merged (active) at a time
Switching: To switch LoRAs, you must first
unmergethe current one, thensetthe new oneCaching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost
Set LoRA Adapter#
Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters.
Endpoint: POST /v1/set_lora
Parameters:
lora_nickname(string or list of strings, required): A unique identifier for the LoRA adapter(s). Can be a single string or a list of strings for multiple LoRAslora_path(string or list of strings/None, optional): Path to the.safetensorsfile(s) or Hugging Face repo ID(s). Required for the first load; optional if re-activating a cached nickname. If a list, must match the length oflora_nicknametarget(string or list of strings, optional): Which transformer(s) to apply the LoRA to. If a list, must match the length oflora_nickname. Valid values:"all"(default): Apply to all transformers"transformer": Apply only to the primary transformer (high noise for Wan2.2)"transformer_2": Apply only to transformer_2 (low noise for Wan2.2)"critic": Apply only to the critic model
strength(float or list of floats, optional): LoRA strength for merge, default 1.0. If a list, must match the length oflora_nickname. Values < 1.0 reduce the effect, values > 1.0 amplify the effect
Single LoRA Example:
curl -X POST http://localhost:30010/v1/set_lora \
-H "Content-Type: application/json" \
-d '{
"lora_nickname": "lora_name",
"lora_path": "/path/to/lora.safetensors",
"target": "all",
"strength": 0.8
}'
Multiple LoRA Example:
curl -X POST http://localhost:30010/v1/set_lora \
-H "Content-Type: application/json" \
-d '{
"lora_nickname": ["lora_1", "lora_2"],
"lora_path": ["/path/to/lora1.safetensors", "/path/to/lora2.safetensors"],
"target": ["transformer", "transformer_2"],
"strength": [0.8, 1.0]
}'
Multiple LoRA with Same Target:
curl -X POST http://localhost:30010/v1/set_lora \
-H "Content-Type: application/json" \
-d '{
"lora_nickname": ["style_lora", "character_lora"],
"lora_path": ["/path/to/style.safetensors", "/path/to/character.safetensors"],
"target": "all",
"strength": [0.7, 0.9]
}'
[!NOTE] When using multiple LoRAs:
All list parameters (
lora_nickname,lora_path,target,strength) must have the same lengthIf
targetorstrengthis a single value, it will be applied to all LoRAsMultiple LoRAs applied to the same target will be merged in order
Merge LoRA Weights#
Manually merges the currently set LoRA weights into the base model.
[!NOTE]
set_loraautomatically performs a merge, so this is typically only needed if you have manually unmerged but want to re-apply the same LoRA without callingset_loraagain.*
Endpoint: POST /v1/merge_lora_weights
Parameters:
target(string, optional): Which transformer(s) to merge. One of “all” (default), “transformer”, “transformer_2”, “critic”strength(float, optional): LoRA strength for merge, default 1.0. Values < 1.0 reduce the effect, values > 1.0 amplify the effect
Curl Example:
curl -X POST http://localhost:30010/v1/merge_lora_weights \
-H "Content-Type: application/json" \
-d '{"strength": 0.8}'
Unmerge LoRA Weights#
Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This must be called before setting a different LoRA.
Endpoint: POST /v1/unmerge_lora_weights
Curl Example:
curl -X POST http://localhost:30010/v1/unmerge_lora_weights \
-H "Content-Type: application/json"
List LoRA Adapters#
Returns loaded LoRA adapters and current application status per module.
Endpoint: GET /v1/list_loras
Curl Example:
curl -sS -X GET "http://localhost:30010/v1/list_loras"
Response Example:
{
"loaded_adapters": [
{ "nickname": "lora_a", "path": "/weights/lora_a.safetensors" },
{ "nickname": "lora_b", "path": "/weights/lora_b.safetensors" }
],
"active": {
"transformer": [
{
"nickname": "lora2",
"path": "tarn59/pixel_art_style_lora_z_image_turbo",
"merged": true,
"strength": 1.0
}
]
}
}
Notes:
If LoRA is not enabled for the current pipeline, the server will return an error.
num_lora_layers_with_weightscounts only layers that have LoRA weights applied for the active adapter.
Example: Switching LoRAs#
Set LoRA A:
curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_a", "lora_path": "path/to/A"}'
Generate with LoRA A…
Unmerge LoRA A:
curl -X POST http://localhost:30010/v1/unmerge_lora_weights
Set LoRA B:
curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}'
Generate with LoRA B…
Attention Backends#
This document describes the attention backends available in sglang diffusion (sglang.multimodal_gen) and how to select them.
Overview#
Attention backends are defined by AttentionBackendEnum (sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum) and selected via the CLI flag --attention-backend.
Backend selection is performed by the shared attention layers (e.g. LocalAttention / USPAttention / UlyssesAttention in sglang.multimodal_gen.runtime.layers.attention.layer) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders).
CUDA: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.
ROCm: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.
MPS: always uses PyTorch SDPA.
Backend options#
The CLI accepts the lowercase names of AttentionBackendEnum. The table below lists the backends implemented by the built-in platforms. fa3/fa4 are accepted as aliases for fa.
CLI value |
Enum value |
Notes |
|---|---|---|
|
|
FlashAttention. |
|
|
PyTorch |
|
|
Sliding Tile Attention (STA). Requires |
|
|
Requires |
|
|
Requires SageAttention3 installed per upstream instructions. |
|
|
Requires |
|
|
Requires |
|
|
Requires |
Selection priority#
The selection order in runtime/layers/attention/selector.py is:
global_force_attn_backend(...)/global_force_attn_backend_context_manager(...)CLI
--attention-backend(ServerArgs.attention_backend)Auto selection (platform capability, dtype, and installed packages)
Platform support matrix#
Backend |
CUDA |
ROCm |
MPS |
Notes |
|---|---|---|---|---|
|
✅ |
✅ |
❌ |
CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to |
|
✅ |
✅ |
✅ |
Most compatible option across platforms. |
|
✅ |
❌ |
❌ |
CUDA-only. Requires |
|
✅ |
❌ |
❌ |
CUDA-only (optional dependency). |
|
✅ |
❌ |
❌ |
CUDA-only (optional dependency). |
|
✅ |
❌ |
❌ |
CUDA-only. Requires |
|
✅ |
❌ |
❌ |
CUDA-only. Requires |
|
✅ |
❌ |
❌ |
Requires |
Usage#
Select a backend via CLI#
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend fa
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend torch_sdpa
Using Sliding Tile Attention (STA)#
export SGLANG_DIFFUSION_ATTENTION_CONFIG=/abs/path/to/mask_strategy.json
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend sliding_tile_attn
Notes for ROCm / MPS#
ROCm: use
--attention-backend torch_sdpaorfadepending on what is available in your environment.MPS: the platform implementation always uses
torch_sdpa.
Cache-DiT Acceleration#
SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss.
Overview#
Cache-DiT uses intelligent caching strategies to skip redundant computation in the denoising loop:
DBCache (Dual Block Cache): Dynamically decides when to cache transformer blocks based on residual differences
TaylorSeer: Uses Taylor expansion for calibration to optimize caching decisions
SCM (Step Computation Masking): Step-level caching control for additional speedup
Basic Usage#
Enable Cache-DiT by exporting the environment variable and using sglang generate or sglang serve :
SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A beautiful sunset over the mountains"
Advanced Configuration#
DBCache Parameters#
DBCache controls block-level caching behavior:
Parameter |
Env Variable |
Default |
Description |
|---|---|---|---|
Fn |
|
1 |
Number of first blocks to always compute |
Bn |
|
0 |
Number of last blocks to always compute |
W |
|
4 |
Warmup steps before caching starts |
R |
|
0.24 |
Residual difference threshold |
MC |
|
3 |
Maximum continuous cached steps |
TaylorSeer Configuration#
TaylorSeer improves caching accuracy using Taylor expansion:
Parameter |
Env Variable |
Default |
Description |
|---|---|---|---|
Enable |
|
false |
Enable TaylorSeer calibrator |
Order |
|
1 |
Taylor expansion order (1 or 2) |
Combined Configuration Example#
DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters simultaneously:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang generate --model-path black-forest-labs/FLUX.1-dev \
--prompt "A curious raccoon in a forest"
SCM (Step Computation Masking)#
SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and which to use cached results.
SCM Presets#
SCM is configured with presets:
Preset |
Compute Ratio |
Speed |
Quality |
|---|---|---|---|
|
100% |
Baseline |
Best |
|
~75% |
~1.3x |
High |
|
~50% |
~2x |
Good |
|
~35% |
~3x |
Acceptable |
|
~25% |
~4x |
Lower |
Usage#
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
Custom SCM Bins#
For fine-grained control over which steps to compute vs cache:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
SCM Policy#
Policy |
Env Variable |
Description |
|---|---|---|
|
|
Adaptive caching based on content (default) |
|
|
Fixed caching pattern |
Environment Variables#
All Cache-DiT parameters can be set via the following environment variables:
Environment Variable |
Default |
Description |
|---|---|---|
|
false |
Enable Cache-DiT acceleration |
|
1 |
First N blocks to always compute |
|
0 |
Last N blocks to always compute |
|
4 |
Warmup steps before caching |
|
0.24 |
Residual difference threshold |
|
3 |
Max continuous cached steps |
|
false |
Enable TaylorSeer calibrator |
|
1 |
TaylorSeer order (1 or 2) |
|
none |
SCM preset (none/slow/medium/fast/ultra) |
|
dynamic |
SCM caching policy |
|
not set |
Custom SCM compute bins |
|
not set |
Custom SCM cache bins |
Supported Models#
SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:
Model Family |
Example Models |
|---|---|
Wan |
Wan2.1, Wan2.2 |
Flux |
FLUX.1-dev, FLUX.2-dev, FLUX.2-Klein |
Z-Image |
Z-Image-Turbo |
Qwen |
Qwen-Image, Qwen-Image-Edit |
GLM |
GLM-Image |
Hunyuan |
HunyuanVideo |
Performance Tips#
Start with defaults: The default parameters work well for most models
Use TaylorSeer: It typically improves both speed and quality
Tune R threshold: Lower values = better quality, higher values = faster
SCM for extra speed: Use
mediumpreset for good speed/quality balanceWarmup matters: Higher warmup = more stable caching decisions
Limitations#
Single GPU only: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically disabled when
world_size > 1SCM minimum steps: SCM requires >= 8 inference steps to be effective
Model support: Only models registered in Cache-DiT’s BlockAdapterRegister are supported
Troubleshooting#
Distributed environment warning#
WARNING: cache-dit is disabled in distributed environment (world_size=N)
This is expected behavior. Cache-DiT currently only supports single-GPU inference.
SCM disabled for low step count#
For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache acceleration still works.
References#
Profiling Multimodal Generation#
This guide covers profiling techniques for multimodal generation pipelines in SGLang.
PyTorch Profiler#
PyTorch Profiler provides detailed kernel execution time, call stack, and GPU utilization metrics.
Denoising Stage Profiling#
Profile the denoising stage with sampled timesteps (default: 5 steps after 1 warmup step):
sglang generate \
--model-path Qwen/Qwen-Image \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--seed 0 \
--profile
Parameters:
--profile: Enable profiling for the denoising stage--num-profiled-timesteps N: Number of timesteps to profile after warmup (default: 5)Smaller values reduce trace file size
Example:
--num-profiled-timesteps 10profiles 10 steps after 1 warmup step
Full Pipeline Profiling#
Profile all pipeline stages (text encoding, denoising, VAE decoding, etc.):
sglang generate \
--model-path Qwen/Qwen-Image \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--seed 0 \
--profile \
--profile-all-stages
Parameters:
--profile-all-stages: Used with--profile, profile all pipeline stages instead of just denoising
Output Location#
By default, trace files are saved in the ./logs/ directory.
The exact output file path will be shown in the console output, for example:
[mm-dd hh:mm:ss] Saved profiler traces to: /sgl-workspace/sglang/logs/mocked_fake_id_for_offline_generate-5_steps-global-rank0.trace.json.gz
View Traces#
Load and visualize trace files at:
https://ui.perfetto.dev/ (recommended)
chrome://tracing (Chrome only)
For large trace files, reduce --num-profiled-timesteps or avoid using --profile-all-stages.
--perf-dump-path (Stage/Step Timing Dump)#
Besides profiler traces, you can also dump a lightweight JSON report that contains:
stage-level timing breakdown for the full pipeline
step-level timing breakdown for the denoising stage (per diffusion step)
This is useful to quickly identify which stage dominates end-to-end latency, and whether denoising steps have uniform runtimes (and if not, which step has an abnormal spike).
The dumped JSON contains a denoise_steps_ms field formatted as an array of objects, each with a step key (the step index) and a duration_ms key.
Example:
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "<PROMPT>" \
--perf-dump-path perf.json
Nsight Systems#
Nsight Systems provides low-level CUDA profiling with kernel details, register usage, and memory access patterns.
Installation#
See the SGLang profiling guide for installation instructions.
Basic Profiling#
Profile the entire pipeline execution:
nsys profile \
--trace-fork-before-exec=true \
--cuda-graph-trace=node \
--force-overwrite=true \
-o QwenImage \
sglang generate \
--model-path Qwen/Qwen-Image \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--seed 0
Targeted Stage Profiling#
Use --delay and --duration to capture specific stages and reduce file size:
nsys profile \
--trace-fork-before-exec=true \
--cuda-graph-trace=node \
--force-overwrite=true \
--delay 10 \
--duration 30 \
-o QwenImage_denoising \
sglang generate \
--model-path Qwen/Qwen-Image \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--seed 0
Parameters:
--delay N: Wait N seconds before starting capture (skip initialization overhead)--duration N: Capture for N seconds (focus on specific stages)--force-overwrite: Overwrite existing output files
Notes#
Reduce trace size: Use
--num-profiled-timestepswith smaller values or--delay/--durationwith Nsight SystemsStage-specific analysis: Use
--profilealone for denoising stage, add--profile-all-stagesfor full pipelineMultiple runs: Profile with different prompts and resolutions to identify bottlenecks across workloads
FAQ#
If you are profiling
sglang generatewith Nsight Systems and find that the generated profiler file did not capture any CUDA kernels, you can resolve this issue by increasing the model’s inference steps to extend the execution time.
Contributing to SGLang Diffusion#
This guide outlines the requirements for contributing to the SGLang Diffusion module (sglang.multimodal_gen).
1. Commit Message Convention#
We follow a structured commit message format to maintain a clean history.
Format:
[diffusion] <scope>: <subject>
Examples:
[diffusion] cli: add --perf-dump-path argument[diffusion] scheduler: fix deadlock in batch processing[diffusion] model: support Stable Diffusion 3.5
Rules:
Prefix: Always start with
[diffusion].Scope (Optional):
cli,scheduler,model,pipeline,docs, etc.Subject: Imperative mood, short and clear (e.g., “add feature” not “added feature”).
2. Performance Reporting#
For PRs that impact latency, throughput, or memory usage, you should provide a performance comparison report.
How to Generate a Report#
Baseline: run the benchmark (for a single generation task)
$ sglang generate --model-path <model> --prompt "A benchmark prompt" --perf-dump-path baseline.json
New: run the same benchmark, without modifying any server_args or sampling_params
$ sglang generate --model-path <model> --prompt "A benchmark prompt" --perf-dump-path new.json
Compare: run the compare script, which will print a Markdown table to the console
$ python python/sglang/multimodal_gen/benchmarks/compare_perf.py baseline.json new.json [new2.json ...] ### Performance Comparison Report ...
Paste: paste the table into the PR description
3. CI-Based Change Protection#
Consider adding tests to the pr-test or nightly-test suites to safeguard your changes, especially for PRs that:
support a new model
support or fix important features
significantly improve performance
See test for examples
How to Support New Diffusion Models#
SGLang diffusion uses a modular pipeline architecture built around two key concepts:
ComposedPipeline: OrchestratesPipelineStages to define the complete generation processPipelineStage: Modular components (prompt encoding, denoising loop, VAE decoding, etc.)
To add a new model, you’ll need to define:
PipelineConfig: Static model configurations (paths, precision settings)SamplingParams: Runtime generation parameters (prompt, guidance_scale, steps)ComposedPipeline: Chain together pipeline stagesModules: Model components (text_encoder, transformer, vae, scheduler)
For the complete implementation guide with examples, see: How to Support New Diffusion Models