Cuda Graph for Multi-Modal Encoder in SGLang#

Motivation#

In multimodal reasoning services, the visual encoder (ViT / Vision Transformer) typically has a few characteristic traits:

Many layers, fragmented operators: Each layer includes LN, QKV projections, attention, MLP, residual connections, etc., resulting in extremely frequent kernel launches.

Server-side “small batch / low latency” is common: The batch size is very small (sometimes it looks like 1 after “flattening” the batch), so kernel launch overhead accounts for a large portion of end-to-end latency.

Input token count (number of patches) varies frequently: Different image/video resolutions and different batch composition lead to different sequence lengths S — and this is precisely the biggest obstacle for CUDA Graph (unstable shapes).

The value of CUDA Graph: It captures a long sequence of GPU kernels with fixed shapes and fixed memory addresses into a graph; later, for the same shapes, it can replay the graph directly, dramatically reducing launch overhead and making GPU scheduling more compact.

This led us to seek a CUDA Graph enabled feature for ViT in order to improve ViT performance.

Design and Restrictions#

The new CUDA Graph enabled ViT logic is built on ViTCudaGraphRunner. This runner captures the “blocks + merger + deepstack merger (optional)” part of a vision transformer into a CUDA graph and replays it for identical shapes. See the following design consideration and restrictions for more details.

Dynamic inputs to fit static constraints of CUDA Graph#

Variable sequence length S is very common in ViT. While CUDA Graph requires fixed shapes. The solution is to build a graph cache by S(e.g., graph_key = S). The first time create a new S, and then capture a graph; afterwards, replay it.

If there are many distinct S values, we need to increase VRAM usage which is graph-private memory pools for many graphs.

Stable addresses#

Everything “parameter-like” becomes a static buffer:

  • block_input / block_ws / block_output

  • cu_full_len / cu_window_len and their kk variants

  • sin_cos_ws

In this way to solve the underlying requirement: during replay, not allowed to swap tensors, can only modify tensor contents.

Attention backend arguments#

Attention backend arguments are fixed inside the graph:

TritonAttn expects [cu_seqlens, cu_seqlens_kk, max_len] FA3 expects [cu_seqlens, max_len]

max_len is frozen as an int constant. cu_seqlens is cached into a dict during create_graph(), and its contents are not updated during subsequent replays.

For the same graph_key = S, you not only require the input shape to match, but also require the segmentation pattern in cu_seqlens (and window seqlens) to be identical. Otherwise, attention will segment the sequence incorrectly.

Rotary buffer management#

The feature reallocates a larger sin_cos_ws when seq_len increases. The max_content_len is used to make sure the maximum size of the allocated rotary buffer.

Command Example#

You can enable CUDA Graph for ViT by setting env variable SGLANG_VIT_ENABLE_CUDA_GRAPH=1, for example:

SGLANG_VIT_ENABLE_CUDA_GRAPH=1 \
python3 -m sglang.launch_server \
  --model Qwen/Qwen3-VL-8B-Instruct

Or you can run CUDA Graph for ViT together with Piecewise CUDA Graph feature by both setting env variable SGLANG_VIT_ENABLE_CUDA_GRAPH=1 and setting --enable-piecewise-cuda-graph, for example:

SGLANG_VIT_ENABLE_CUDA_GRAPH=1 \
python3 -m sglang.launch_server \
  --model Qwen/Qwen3-VL-8B-Instruct \
  --piecewise-cuda-graph-max-tokens 4096 \
  --enable-piecewise-cuda-graph \
  --piecewise-cuda-graph-compiler eager

Known supported models#

  • Qwen2.5-VL (https://github.com/sgl-project/sglang/pull/14422)

  • Qwen3-VL (https://github.com/sgl-project/sglang/pull/15320)