SGLang for RL Systems#
This document is a practical guide for infrastructure teams integrating SGLang into RL and post-training systems. It focuses on the operational pain points in the loop (rollout, evaluation, training, weight sync) and maps them to concrete SGLang APIs, flags, and integration patterns. The focus is on maximizing rollout efficiency, accuracy and stability while keeping rollout-serving behavior aligned in production environments.
Why SGLang for RL Lifecycle?#
Let’s embrace a guiding principle from early DeepMind’s RL engineering:
Be a library, not a framework.
This philosophy empowers innovation by providing SGLang as flexible tools, not rigid structures. Here are five reasons to use SGLang for your RL lifecycle:
Fine-Grained Engine Sleep and Wake Up: facilitate maximum-powered rollout and training
Open-To-Use Refit Functionality: diverse methods for co-location or disaggregation
Easy To Postpone Generation: enable partial rollout and dedicated rollout control
Deterministic Inference: achieve deterministic inference to enable zero training-inference mismatch
Load Balancing Router: cache-aware load-balancing for high-throughput rollout
The following sections cover these aspects in detail.
Fine-Grained Engine Sleep and Wake Up#
Rollout and training are both memory-intensive, and co-locating them on the same GPUs often leads to memory pressure and slow handoffs. SGLang provides a memory-aware sleep/wake mechanism that releases KV cache and weights while keeping the server process alive, then resumes them for rollout without a full restart. This avoids repeated disk I/O and CUDA graph recapture during each RL step.
Under the hood, the RL team uses CUDA-graph-aware weight offload via torch_memory_saver to preserve virtual memory addresses for graph replay. For details, see: Efficient RL Training - Optimizing Memory Usage in verl.
Server flag#
Enable memory saver support when launching the server:
--enable-memory-saver
Release Memory#
Endpoint: POST /release_memory_occupation
Request body:
Field |
Description |
Defaults |
Options |
|---|---|---|---|
|
Which memory regions to release. If omitted, all are released. |
|
Type: list[str], values: |
Behavior notes:
This call asserts there are no ongoing requests. Ensure the engine is idle before calling it.
If
kv_cacheis released, SGLang flushes cache; subsequent requests will rebuild KV cache as needed.
Resume Memory#
Endpoint: POST /resume_memory_occupation
Request body:
Field |
Description |
Defaults |
Options |
|---|---|---|---|
|
Which memory regions to resume. If omitted, all are resumed. |
|
Type: list[str], values: |
Open-To-Use Refit Functionality#
After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. For a deeper dive into SGLang’s weight update utilities, see RL System Deep Thinking: Weight Update Mechanisms.
How to choose:
From disk is simplest and best for elastic rollout scaling and checkpointing.
From tensor is best for co-located training/rollout when you can pass in-memory tensors.
From distributed is best for disaggregated training/rollout with dedicated communication groups (NCCL/IB).
Update Weights from Disk#
When to use:
Save checkpoint to disk and update weights from disk
Dynamic scaling (new rollout instances can load from the same checkpoint)
Why it works well:
This path trades some I/O overhead for simplicity and flexibility. It integrates naturally with checkpointing and makes it trivial to add new rollout engines: point them at the same checkpoint and call the API. It is also the safest option for high availability because the checkpoint itself is the source of truth.
Endpoint: POST /update_weights_from_disk
Request body:
Field |
Description |
Defaults |
Options |
|---|---|---|---|
|
The model path with the new weights. |
Required |
Type: str |
|
The format to load the weights. |
|
Type: str |
|
Abort all running requests before update. |
|
Type: bool |
|
Optional weight version label tracked by the server. |
|
Type: str |
|
Perform weight load asynchronously. |
|
Type: bool |
|
Empty torch cache. |
|
Type: bool |
|
Keep scheduler paused after update. |
|
Type: bool |
|
Recapture CUDA graphs after update. |
|
Type: bool |
|
Trainer step id for rollout bookkeeping. |
|
Type: int |
|
Flush KV cache after update. |
|
Type: bool |
Response body:
Field |
Description |
Defaults |
Options |
|---|---|---|---|
|
Whether the update succeeded. |
- |
Type: bool |
|
Status / error message. |
- |
Type: str |
|
Number of paused requests during update. |
|
Type: int |
Python Engine API: engine.update_weights_from_disk(model_path, load_format=None)
Update Weights from Tensor#
When to use:
Co-located training and rollout, where training can provide tensors directly
Fast in-memory updates
Important constraints:
This strategy requires the training process and rollout engine to share access to the tensors. Co-located setups must keep the model on GPU; moving tensors to CPU will break the update path. For high-performance MoE or specialized attention kernels, co-location may limit some optimizations compared to disaggregated rollouts.
Endpoint: POST /update_weights_from_tensor
Request body:
Field |
Description |
Defaults |
Options |
|---|---|---|---|
|
Per-TP serialized tensor payloads. |
Required |
Type: list[str |
|
Optional load format selector. |
|
|
|
Flush KV cache after update. |
|
Type: bool |
|
Abort all running requests before update. |
|
Type: bool |
|
Optional version label tracked by the server. |
|
Type: str |
Note: The serialized tensor payloads must be created with MultiprocessingSerializer.serialize(...) and should be base64-safe strings.
Python Engine API: engine.update_weights_from_tensor(named_tensors, load_format=None, flush_cache=True)
Update Weights from Distributed Group#
When to use:
Disaggregated training and rollout
NCCL or IB-backed weight broadcast from training workers to rollout workers
How it works:
Training workers gather weights (typically on TP rank 0), broadcast them to the rollout group, and each rollout TP shard loads the parameters it needs. This avoids disk I/O and keeps training and rollout decoupled, at the cost of managing a dedicated communication group.
Initialize weight update group
Endpoint: POST /init_weights_update_group
Request body:
Field |
Description |
Defaults |
Options |
|---|---|---|---|
|
Group master address. |
Required |
Type: str |
|
Group master port. |
Required |
Type: int |
|
Offset for local rank mapping. |
Required |
Type: int |
|
Total world size. |
Required |
Type: int |
|
Group name. |
|
Type: str |
|
Communication backend. |
|
Type: str |
Update weight
Endpoint: POST /update_weights_from_distributed
Request body:
Field |
Description |
Defaults |
Options |
|---|---|---|---|
|
Parameter names to update. |
Required |
Type: list[str] |
|
Dtype strings for each parameter. |
Required |
Type: list[str] |
|
Tensor shapes. |
Required |
Type: list[list[int]] |
|
Group name. |
|
Type: str |
|
Flush KV cache after update. |
|
Type: bool |
|
Abort all running requests before update. |
|
Type: bool |
|
Optional version label. |
|
Type: str |
|
Optional format selector. |
|
|
Destroy weights update group
Endpoint: POST /destroy_weights_update_group
Request body:
Field |
Description |
Defaults |
Options |
|---|---|---|---|
|
Group name. |
|
Type: str |
Python Engine APIs:
engine.init_weights_update_group(...)engine.update_weights_from_distributed(names, dtypes, shapes, ...)engine.destroy_weights_update_group(group_name)
Easy To Postpone Generation#
Multi-turn RL rollouts often suffer from long-tail requests that block the entire batch. A small number of slow interactions can stall all GPUs, and the long-tail behavior makes profiling and monitoring difficult.
SGLang exposes explicit pause/resume APIs so you can pause slow requests and continue them later. This pattern matches systems like APRIL, terminate once enough responses are collected, and recycle incomplete responses in the next step. The result is higher GPU utilization without discarding partial work.
pause_generation — update weights — continue_generation is the correct execution flow when updating weights from training. An update can only happen when SGLang is not actively processing inference tasks.
Pause Generation#
Endpoint: POST /pause_generation
Request body:
Field |
Description |
Defaults |
Options |
|---|---|---|---|
|
Pause mode. |
|
|
Modes:
abort: Default behavior, identical toabortendpoint withabort_allset. Pending requests fromwaiting_queueandrunning_queuewill be returned immediately to the caller.retract: Put engine in “paused” state. Move running requests back to waiting queue. KV cache can be flushed and recomputed later.in_place: Put engine in “paused” state without changing states of the requests. Running requests rely on availability of KV caches to continue, so any subsequentflush_cachecall will be unsuccessful.
Continue Generation#
Endpoint: POST /continue_generation
Deterministic Inference#
In many RL stacks, rollout and training are implemented with different kernels or batching behavior. Even when weights are identical, token probabilities can drift, silently breaking the on-policy assumption. This is the training–inference mismatch problem.
SGLang supports a deterministic inference mode that reduces non-determinism across batch shapes. This mitigates variance introduced by runtime batching and kernel selection. To further achieve true on-policy training, you need to modify the training engine to use the same deterministic kernels. For implementation details, see these miles examples: True On-Policy and True On-Policy for VLM. For additional context, see the blog post Let Speed Be With Stability: All-In-One Solution to Training-Inference Mismatch with Miles.
Server flag:
--enable-deterministic-inference
For more details, see Deterministic Inference
Load Balancing Router#
SGLang Model Gateway is the recommended control plane for large‑scale RL rollouts. It provides async, non‑blocking request handling, cache‑aware load balancing, and fault‑tolerant routing across rollout and reward servers. This lets you keep GPUs saturated while avoiding long‑tail stalls and brittle, engine‑local concurrency logic. It has been deployed in the training of GLM 4.5+ models and proven to be highly efficient in production-level large-scale RL workloads.
Key benefits for RL infrastructure:
Async non-blocking efficiency: SGLang’s native async server/router architecture (HTTPS/gRPC) manages concurrency automatically. This guarantees maximum GPU saturation and effective continuous batching without requiring complex, manual implementation by engineers.
Elasticity and fault tolerance: By encapsulating the reward model and rollout as independent servers, SGLang decouples them logically and physically. This architecture provides robust disaster recovery for large-scale distributed training; if a server fails, the router automatically redirects traffic to healthy nodes, ensuring the training process continues without interruption.
Training–Inference alignment: Using the SGLang Model Gateway for both training and inference ensures “What You See Is What You Get.” This eliminates score discrepancies and the painful backend alignment issues often caused by using different engines for training versus deployment.
Dynamic load balancing and long-tail mitigation: Unlike static partitioning, the SGLang Model Gateway enables request-level dynamic dispatching for multi-turn RL. It can distribute different turns of a conversation across different servers to balance workloads and eliminate long-tail latency caused by varying sequence lengths.
For deployment and configuration, see: SGLang Model Gateway