Attention Backends#

This document describes the attention backends available in sglang diffusion (sglang.multimodal_gen) and how to select them.

Overview#

Attention backends are defined by AttentionBackendEnum (sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum) and selected via the CLI flag --attention-backend.

Backend selection is performed by the shared attention layers (e.g. LocalAttention / USPAttention / UlyssesAttention in sglang.multimodal_gen.runtime.layers.attention.layer) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders).

When using the diffusers backend, --attention-backend is passed through to diffusers’ set_attention_backend (e.g., flash, _flash_3_hub, sage, xformers, native).

  • CUDA: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.

  • ROCm: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.

  • MPS: always uses PyTorch SDPA.

Backend options#

For SGLang-native pipelines, the CLI accepts the lowercase names of AttentionBackendEnum. The table below lists the backends implemented by the built-in platforms. fa3/fa4 are accepted as aliases for fa.

CLI value

Enum value

Notes

fa / fa3 / fa4

FA

FlashAttention. fa3/fa4 are normalized to fa during argument parsing (ServerArgs.__post_init__).

torch_sdpa

TORCH_SDPA

PyTorch scaled_dot_product_attention.

sliding_tile_attn

SLIDING_TILE_ATTN

Sliding Tile Attention (STA). Requires st_attn. Configure via --attention-backend-config.

sage_attn

SAGE_ATTN

Requires sageattention. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream setup.py: https://github.com/thu-ml/SageAttention/blob/main/setup.py.

sage_attn_3

SAGE_ATTN_3

Requires SageAttention3 installed per upstream instructions.

video_sparse_attn

VIDEO_SPARSE_ATTN

Requires vsa. Configure sparsity via --attention-backend-config.

vmoba_attn

VMOBA_ATTN

Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config.

aiter

AITER

Requires aiter.

Selection priority#

The selection order in runtime/layers/attention/selector.py is:

  1. global_force_attn_backend(...) / global_force_attn_backend_context_manager(...)

  2. CLI --attention-backend (ServerArgs.attention_backend)

  3. Auto selection (platform capability, dtype, and installed packages)

Configuration#

Some backends require additional configuration. You can pass these parameters via --attention-backend-config. This argument accepts:

  • A path to a JSON or YAML configuration file.

  • A JSON string (e.g., '{"sparsity": 0.5}').

  • Key-value pairs (e.g., "sparsity=0.5,enable_x=true").

Supported Configuration Parameters#

Sliding Tile Attention (sliding_tile_attn)

Parameter

Type

Description

Default

mask_strategy_file_path

str

Required. Path to the mask strategy JSON file.

-

sta_mode

str

Mode of STA.

STA_inference

skip_time_steps

int

Number of steps to use full attention before switching to sparse attention.

15

Video Sparse Attention (video_sparse_attn)

Parameter

Type

Description

Default

sparsity

float

Validation sparsity (0.0 - 1.0).

0.0

V-MoBA (vmoba_attn)

Parameter

Type

Description

Default

temporal_chunk_size

int

Chunk size for temporal dimension.

-

temporal_topk

int

Top-K tokens to select in temporal dimension.

-

spatial_chunk_size

list[int]

Chunk size for spatial dimension (H, W).

-

spatial_topk

int

Top-K tokens to select in spatial dimension.

-

st_chunk_size

list[int]

Chunk size for spatiotemporal dimension (T, H, W).

-

st_topk

int

Top-K tokens to select in spatiotemporal dimension.

-

moba_select_mode

str

Selection mode (e.g., threshold).

threshold

moba_threshold

float

Threshold value for selection.

0.25

moba_threshold_type

str

Type of thresholding (e.g., query_head).

query_head

first_full_step

int

Number of initial steps to use full attention.

12

first_full_layer

int

Number of initial layers to use full attention.

0

temporal_layer

int

Number of temporal layers.

1

spatial_layer

int

Number of spatial layers.

1

st_layer

int

Number of spatiotemporal layers.

1

Platform support matrix#

Backend

CUDA

ROCm

MPS

Notes

fa

CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to torch_sdpa.

torch_sdpa

Most compatible option across platforms.

sliding_tile_attn

CUDA-only. Requires st_attn. Configure via --attention-backend-config.

sage_attn

CUDA-only (optional dependency).

sage_attn_3

CUDA-only (optional dependency).

video_sparse_attn

CUDA-only. Requires vsa. Configure sparsity via --attention-backend-config.

vmoba_attn

CUDA-only. Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config.

aiter

Requires aiter.

Usage#

Select a backend via CLI#

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend fa
sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend torch_sdpa

Using Sliding Tile Attention (STA)#

# Pass the mask strategy file path via config
sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend sliding_tile_attn \
  --attention-backend-config "mask_strategy_file_path=/abs/path/to/mask_strategy.json"

Notes for ROCm / MPS#

  • ROCm: use --attention-backend torch_sdpa or fa depending on what is available in your environment.

  • MPS: the platform implementation always uses torch_sdpa.