Motivation
Standard CUDA graphs capture an entire forward pass as a single, opaque graph. This is great for performance, but creates two problems:- Debugging is hard. When something goes wrong inside a captured graph (wrong outputs, numerical mismatches, crashes), there is no way to step through the operations or insert print statements because the graph replays as a monolithic unit.
- Some ops are incompatible. Certain operations — dynamic control flow, host-device synchronization, JIT compilation, or ops that change behavior across iterations — cannot be captured into a CUDA graph at all. Today, the only workaround is to disable CUDA graphs entirely, which sacrifices the kernel launch overhead savings for the rest of the model.
Usage
Debug Mode: Run Everything Eagerly
The simplest use case is debugging. The--debug-cuda-graph flag wraps the entire decode forward pass in a graph break, so every operation runs eagerly while still going through the full CUDA graph capture/replay code path. This lets you debug CUDA graph issues without changing model code.
Selective Graph Breaks in Model Code
For production use, you can mark specific functions as “non-graphable” using the@eager_on_graph decorator. During CUDA graph capture, these functions run eagerly between captured graph segments. Outside of capture, they behave normally.
break_graph() helper:
Server Args
| Argument | Default | Description |
|---|---|---|
—debug-cuda-graph | False | Enable debug/eager mode. Wraps the entire forward pass in a graph break so every op runs eagerly through the capture/replay path. |
SGLANG_USE_BREAKABLE_CUDA_GRAPH | 0 | Environment variable. Enables breakable CUDA graph without debug mode. Required for @eager_on_graph decorators to take effect. |
How It Works
Capture
Breakable CUDA graph extends PyTorch’storch.cuda.CUDAGraph by splitting a single capture into multiple segments separated by graph breaks.
During capture, the flow is:
Replay
During replay:Output Writeback
When a non-graph function produces output during replay, the result must be written back into the same tensor buffers that downstream graph segments reference. The mechanism handles:- Plain tensors: In-place
copy_()into the original buffer. - Structured outputs (dataclasses, objects with tensor attributes): Tensor fields are copied in-place; non-tensor fields are replaced.
- Dicts of tensors: Tensor values are copied in-place; non-tensor values are replaced.
Stream Fork/Join Tracking
Some models fork work onto secondary CUDA streams (e.g., for overlapped computation). Breakable CUDA graph hookstorch.cuda.Stream.wait_stream to track which streams are forked from the capture stream. When a graph break occurs, all forked streams are automatically joined back before ending the capture segment, and re-forked after beginning the next segment.
Compatibility
- NVIDIA CUDA only. Breakable CUDA graph is not supported on ROCm/HIP or other non-CUDA platforms. On unsupported platforms,
--debug-cuda-graphis automatically disabled with a warning. - Requires
cuda-python. Thecuda.bindingspackage must be installed (pip install cuda-python). - Not compatible with memory saver mode. Cannot be used together with
SGLANG_MEMORY_SAVER_CUDA_GRAPH.
Performance
When no graph breaks are inserted, breakable CUDA graph has minimal overhead compared to standard CUDA graph — the capture/replay path is nearly identical. Each graph break adds:- One
cudaGraphLaunchcall (to replay the segment before the break) - One eager Python function call
- One
cudaStreamBeginCapture/cudaStreamEndCapturepair during capture
Code Reference
| File | Description |
|---|---|
python/sglang/srt/model_executor/breakable_cuda_graph/breakable_cuda_graph.py | Core implementation: eager_on_graph, BreakableCUDAGraph, BreakableCUDAGraphCapture |
python/sglang/srt/model_executor/breakable_cuda_graph/cuda_utils.py | CUDA runtime binding utilities |
python/sglang/srt/model_executor/cuda_graph_runner.py | Integration with the main CUDA graph runner |
python/sglang/srt/server_args.py | —debug-cuda-graph flag and environment variable handling |
python/sglang/srt/environ.py | SGLANG_USE_BREAKABLE_CUDA_GRAPH environment variable definition |
