> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Checkpoint Engine Integration

The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.

## Overview

The checkpoint engine integration allows SGLang to:

* Load model weights in parallel using multiple processes
* Distribute weight loading across multiple nodes to increase effective disk bandwidth
* Overlap weight loading with other initialization tasks like CUDA graph capture
* Support both single-node and multi-node deployments

## Installation

First, install the checkpoint engine package:

```bash Command theme={null}
pip install 'checkpoint-engine[p2p]'
```

## Architecture

The system consists of two main components:

1. **SGLang Server**: Runs with `--wait-for-initial-weights` flag to wait for weights before becoming ready
2. **Checkpoint Engine Workers**: Separate processes (managed by torchrun) that load and distribute model weights

The checkpoint engine uses a parameter server architecture with support for:

* **Broadcast mode**: Weights are broadcast from loading processes to inference processes
* **P2P mode**: Direct peer-to-peer weight transfer between processes
* **All mode**: Combination of both broadcast and P2P methods

## Usage Examples

### Single Node Setup

**Terminal 1 - Launch SGLang Server:**

```bash Command theme={null}
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights
```

**Terminal 2 - Run Checkpoint Engine:**

Using sglang entrypoint:

```bash Command theme={null}
python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8
```

Using torchrun directly:

```bash Command theme={null}
torchrun --nproc-per-node 8 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8
```

### Multi-Node Setup (2 Nodes)

**Node 0:**

Launch SGLang server:

```bash Command theme={null}
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP]
```

Run checkpoint engine:

Using sglang entrypoint (recommended):

```bash Command theme={null}
python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8
```

Using torchrun directly:

```bash Command theme={null}
torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 0 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8
```

**Node 1:**

Launch SGLang server:

```bash Command theme={null}
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP]
```

Run checkpoint engine:

Using sglang entrypoint (recommended):

```bash Command theme={null}
python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8
```

Using torchrun directly:

```bash Command theme={null}
torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 1 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8
```

### Multi-Node Setup with Tensor Parallelism (TP=16)

**Node 0:**

Launch SGLang server:

```bash Command theme={null}
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP] \
    --dist-init-addr [IP]:9120 \
    --nnodes 2 \
    --node-rank 0
```

Run checkpoint engine:

Using sglang entrypoint (recommended):

```bash Command theme={null}
python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16
```

Using torchrun directly:

```bash Command theme={null}
torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 0 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16
```

**Node 1:**

Launch SGLang server:

```bash Command theme={null}
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP] \
    --dist-init-addr [IP]:9120 \
    --nnodes 2 \
    --node-rank 1
```

Run checkpoint engine:

Using sglang entrypoint (recommended):

```bash Command theme={null}
python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16
```

Using torchrun directly:

```bash Command theme={null}
torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 1 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16
```

## Configuration Options

### SGLang Server Options

* `--load-format dummy`: Use dummy format for initial loading (allows overlapping with other tasks)
* `--wait-for-initial-weights`: Wait for checkpoint engine to provide weights before becoming ready
* `--host`: Host address for multi-node setups
* `--dist-init-addr`: Distributed initialization address for tensor parallelism

### Checkpoint Engine Options

* `--update-method`: Weight update method (`broadcast`, `p2p`, or `all`)
* `--checkpoint-path`: Path to model checkpoint directory
* `--inference-parallel-size`: Number of inference parallel processes
* `--endpoint`: SGLang server endpoint (default: `http://localhost:19730`)
* `--checkpoint-name`: Name for the checkpoint (default: `my-checkpoint-iter-0`)
* `--save-metas-file`: File to save checkpoint metadata
* `--load-metas-file`: File to load checkpoint metadata from
* `--uds`: Unix domain socket path for communication
* `--weight-version`: Version identifier for weights

## Performance Benefits

The checkpoint engine provides significant time savings in two main aspects:

1. **Multi-node Loading**: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.

2. **Single Process Optimization**: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.

## Troubleshooting

* Ensure checkpoint engine package is installed: `pip install 'checkpoint-engine[p2p]'`
* Verify network connectivity between nodes in multi-node setups
* Check that the checkpoint path contains valid model files
* Monitor logs for connection errors between SGLang server and checkpoint engine
* Use `--sleep-time` parameter to add delays if needed for debugging

## References

* [Checkpoint Engine Repository](https://github.com/MoonshotAI/checkpoint-engine)
