Attention Backends#
This document describes the attention backends available in sglang diffusion (sglang.multimodal_gen) and how to select them.
Overview#
Attention backends are defined by AttentionBackendEnum (sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum) and selected via the CLI flag --attention-backend.
Backend selection is performed by the shared attention layers (e.g. LocalAttention / USPAttention / UlyssesAttention in sglang.multimodal_gen.runtime.layers.attention.layer) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders).
When using the diffusers backend, --attention-backend is passed through to diffusers’
set_attention_backend (e.g., flash, _flash_3_hub, sage, xformers, native).
CUDA: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.
ROCm: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.
MPS: always uses PyTorch SDPA.
Backend options#
For SGLang-native pipelines, the CLI accepts the lowercase names of AttentionBackendEnum. The table below lists the backends implemented by the built-in platforms. fa3/fa4 are accepted as aliases for fa.
CLI value |
Enum value |
Notes |
|---|---|---|
|
|
FlashAttention. |
|
|
PyTorch |
|
|
Sliding Tile Attention (STA). Requires |
|
|
Requires |
|
|
Requires SageAttention3 installed per upstream instructions. |
|
|
Requires |
|
|
Requires |
|
|
Requires |
Selection priority#
The selection order in runtime/layers/attention/selector.py is:
global_force_attn_backend(...)/global_force_attn_backend_context_manager(...)CLI
--attention-backend(ServerArgs.attention_backend)Auto selection (platform capability, dtype, and installed packages)
Configuration#
Some backends require additional configuration. You can pass these parameters via --attention-backend-config. This argument accepts:
A path to a JSON or YAML configuration file.
A JSON string (e.g.,
'{"sparsity": 0.5}').Key-value pairs (e.g.,
"sparsity=0.5,enable_x=true").
Supported Configuration Parameters#
Sliding Tile Attention (sliding_tile_attn)
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Required. Path to the mask strategy JSON file. |
- |
|
|
Mode of STA. |
|
|
|
Number of steps to use full attention before switching to sparse attention. |
|
Video Sparse Attention (video_sparse_attn)
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Validation sparsity (0.0 - 1.0). |
|
V-MoBA (vmoba_attn)
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
|
Chunk size for temporal dimension. |
- |
|
|
Top-K tokens to select in temporal dimension. |
- |
|
|
Chunk size for spatial dimension (H, W). |
- |
|
|
Top-K tokens to select in spatial dimension. |
- |
|
|
Chunk size for spatiotemporal dimension (T, H, W). |
- |
|
|
Top-K tokens to select in spatiotemporal dimension. |
- |
|
|
Selection mode (e.g., |
|
|
|
Threshold value for selection. |
|
|
|
Type of thresholding (e.g., |
|
|
|
Number of initial steps to use full attention. |
|
|
|
Number of initial layers to use full attention. |
|
|
|
Number of temporal layers. |
|
|
|
Number of spatial layers. |
|
|
|
Number of spatiotemporal layers. |
|
Platform support matrix#
Backend |
CUDA |
ROCm |
MPS |
Notes |
|---|---|---|---|---|
|
✅ |
✅ |
❌ |
CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to |
|
✅ |
✅ |
✅ |
Most compatible option across platforms. |
|
✅ |
❌ |
❌ |
CUDA-only. Requires |
|
✅ |
❌ |
❌ |
CUDA-only (optional dependency). |
|
✅ |
❌ |
❌ |
CUDA-only (optional dependency). |
|
✅ |
❌ |
❌ |
CUDA-only. Requires |
|
✅ |
❌ |
❌ |
CUDA-only. Requires |
|
✅ |
❌ |
❌ |
Requires |
Usage#
Select a backend via CLI#
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend fa
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend torch_sdpa
Using Sliding Tile Attention (STA)#
# Pass the mask strategy file path via config
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend sliding_tile_attn \
--attention-backend-config "mask_strategy_file_path=/abs/path/to/mask_strategy.json"
Notes for ROCm / MPS#
ROCm: use
--attention-backend torch_sdpaorfadepending on what is available in your environment.MPS: the platform implementation always uses
torch_sdpa.