> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Disaggregated Diffusion Pipeline

Split a monolithic text-to-video/image pipeline into independent **Encoder**, **Denoiser**, and **Decoder** roles, each running on its own GPU(s). A central **DiffusionServer** routes requests through the pipeline.

## Quick Start

Disaggregation is controlled by a single flag: `--disagg-role`. Each component is launched independently, just like LLM PD disaggregation.

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
  <colgroup>
    <col style={{width: "50%"}} />

    <col style={{width: "50%"}} />
  </colgroup>

  <thead>
    <tr>
      <th><code>--disagg-role</code></th>
      <th>What it runs</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td><code>monolithic</code></td>
      <td>(Default) Standard single-server mode</td>
    </tr>

    <tr>
      <td><code>encoder</code></td>
      <td>All stages with the default <code>RoleType.ENCODER</code> affinity: <code>InputValidationStage</code>, <code>TextEncodingStage</code> (plus <code>ImageEncodingStage</code> / <code>ImageVAEEncodingStage</code> for image-conditioned pipelines), <code>LatentPreparationStage</code>, <code>TimestepPreparationStage</code>, and any model-specific "before denoising" stage (e.g. <code>QwenImageLayeredBeforeDenoisingStage</code>, <code>GlmImageBeforeDenoisingStage</code>).</td>
    </tr>

    <tr>
      <td><code>denoiser</code></td>
      <td><code>DenoisingStage</code> (and its subclasses: <code>CausalDMDDenoisingStage</code>, <code>DmdDenoisingStage</code>, <code>LTX2AVDenoisingStage</code>, <code>LTX2RefinementStage</code>, <code>Hunyuan3DShapeDenoisingStage</code>, ...) — the DiT forward loop plus the scheduler stepping it drives.</td>
    </tr>

    <tr>
      <td><code>decoder</code></td>
      <td><code>DecodingStage</code> (VAE decode) and its subclasses (<code>LTX2AVDecodingStage</code>, <code>HeliosDecodingStage</code>, ...).</td>
    </tr>

    <tr>
      <td><code>server</code></td>
      <td>DiffusionServer head node + HTTP server (no GPU)</td>
    </tr>
  </tbody>
</table>

> Each stage declares its role via the `role_affinity` property on `PipelineStage` (default `ENCODER`). When `--disagg-role` is not `monolithic`, the pipeline only instantiates stages whose affinity matches, so the above table is the source of truth for what actually runs in each process.

### Single-Machine Example (Verified)

The following commands have been tested end-to-end on an 8×H200 machine with
`Wan-AI/Wan2.1-T2V-1.3B-Diffusers`. Each role runs on a separate GPU via
`--base-gpu-id`; the `server` head node requires no GPU.

```bash theme={null}
# Terminal 1: Encoder (GPU 0)
sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --disagg-role encoder \
    --disagg-server-addr tcp://127.0.0.1:19655 \
    --scheduler-port 19000 \
    --num-gpus 1 --base-gpu-id 0

# Terminal 2: Denoiser (GPU 1)
sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --disagg-role denoiser \
    --disagg-server-addr tcp://127.0.0.1:19655 \
    --scheduler-port 19001 \
    --num-gpus 1 --base-gpu-id 1

# Terminal 3: Decoder (GPU 2)
sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --disagg-role decoder \
    --disagg-server-addr tcp://127.0.0.1:19655 \
    --scheduler-port 19002 \
    --num-gpus 1 --base-gpu-id 2

# Terminal 4: DiffusionServer head (no GPU, receives HTTP requests)
sglang serve --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --disagg-role server \
    --encoder-urls  "tcp://127.0.0.1:19000" \
    --denoiser-urls "tcp://127.0.0.1:19001" \
    --decoder-urls  "tcp://127.0.0.1:19002" \
    --host 0.0.0.0 --port 22000 \
    --scheduler-port 19655

# Send request (video generation)
curl http://127.0.0.1:22000/v1/videos \
    -H "Content-Type: application/json" \
    -d '{"model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "prompt": "A curious raccoon exploring a garden, cinematic", "size": "832x480"}'
```

> **Tested result (8×H200):**
> Encoder 2.3 s (TextEncoding) → Denoiser 312.8 s (50 steps, layerwise offload) → Decoder 7.1 s (VAE decode).
> Total \~322 s for 81-frame 1024×1024 video.

> **Tip:** `--base-gpu-id` controls which physical GPU the role uses.
> Encoder and Decoder can share a GPU (e.g. both `--base-gpu-id 0`) to save resources,
> but make sure the combined GPU memory is sufficient.

### Multi-Machine Example

The exact same CLI pattern — just replace `127.0.0.1` with actual IPs and add
RDMA flags for direct transfer:

```bash theme={null}
# Machine A (10.0.0.1): Encoder
sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \
    --disagg-role encoder \
    --disagg-server-addr tcp://10.0.0.4:19655 \
    --scheduler-port 19000 \
    --num-gpus 1 \
    --disagg-p2p-hostname 10.0.0.1 --disagg-ib-device mlx5_0

# Machine B (10.0.0.2): Denoiser (4 GPUs with SP)
sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \
    --disagg-role denoiser \
    --disagg-server-addr tcp://10.0.0.4:19655 \
    --scheduler-port 19001 \
    --num-gpus 4 --denoiser-sp 4 --denoiser-ulysses 2 --denoiser-ring 2 \
    --disagg-p2p-hostname 10.0.0.2 --disagg-ib-device mlx5_0

# Machine C (10.0.0.3): Decoder
sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \
    --disagg-role decoder \
    --disagg-server-addr tcp://10.0.0.4:19655 \
    --scheduler-port 19002 \
    --num-gpus 1 \
    --disagg-p2p-hostname 10.0.0.3 --disagg-ib-device mlx5_0

# Machine D (10.0.0.4): DiffusionServer head
sglang serve --model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \
    --disagg-role server \
    --encoder-urls  "tcp://10.0.0.1:19000" \
    --denoiser-urls "tcp://10.0.0.2:19001" \
    --decoder-urls  "tcp://10.0.0.3:19002" \
    --host 0.0.0.0 --port 30000 \
    --scheduler-port 19655 \
    --disagg-dispatch-policy max_free_slots
```

> ZMQ handles startup order gracefully — instances and head can start in any order.

## Multiple Instances per Role

Use semicolons in `--*-urls` to register multiple instances:

```bash theme={null}
# 2 encoders + 2 denoisers (4-GPU SP each) + 1 decoder
sglang serve --model-path ... --disagg-role server \
    --encoder-urls  "tcp://10.0.0.1:35000;tcp://10.0.0.2:35000" \
    --denoiser-urls "tcp://10.0.0.3:35000;tcp://10.0.0.4:35000" \
    --decoder-urls  "tcp://10.0.0.5:35000"
```

## Port Convention

Result endpoints are derived deterministically from the head node's `--scheduler-port` (default: 5555):

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
  <colgroup>
    <col style={{width: "50%"}} />

    <col style={{width: "50%"}} />
  </colgroup>

  <thead>
    <tr>
      <th>Socket</th>
      <th>Port</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>DS frontend (ROUTER)</td>
      <td><code>scheduler\_port</code></td>
    </tr>

    <tr>
      <td>Encoder result (PULL)</td>
      <td><code>scheduler\_port + 1</code></td>
    </tr>

    <tr>
      <td>Denoiser result (PULL)</td>
      <td><code>scheduler\_port + 2</code></td>
    </tr>

    <tr>
      <td>Decoder result (PULL)</td>
      <td><code>scheduler\_port + 3</code></td>
    </tr>
  </tbody>
</table>

Role instances derive their result endpoint automatically from `--disagg-server-addr`. No manual endpoint configuration needed.

## Transfer Mechanism

Tensor data between roles (encoder→denoiser, denoiser→decoder) is transferred via a P2P transfer engine. The DiffusionServer only routes lightweight control messages (alloc/push/ready); actual tensor data flows directly between instances.

**mooncake-transfer-engine** is required for disaggregated diffusion. It provides RDMA for direct GPU-to-GPU data movement.

```bash theme={null}
pip install mooncake-transfer-engine
```

### Transfer Flow

1. **Sender** (encoder/denoiser) stages tensors: async copy to transfer buffer (GPU or CPU pinned, depending on GPUDirect support), overlapped with metadata JSON serialization.
2. **Sender** sends `transfer_staged` control message to DiffusionServer (metadata only, no tensor data).
3. **DiffusionServer** sends `transfer_alloc` to receiver → receiver allocates buffer slot → replies `transfer_allocated`.
4. **DiffusionServer** sends `transfer_push` to receiver with sender's address info.
5. **Receiver** pulls data via transfer engine (Mooncake RDMA or mock), sends `transfer_ready`.
6. **Receiver** loads tensors async on a dedicated transfer stream, overlapped with the previous request's compute.

Decoder results (final output) flow back through DiffusionServer as raw ZMQ frames to the HTTP client.

### RDMA Flags

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
  <colgroup>
    <col style={{width: "33.33%"}} />

    <col style={{width: "33.33%"}} />

    <col style={{width: "33.33%"}} />
  </colgroup>

  <thead>
    <tr>
      <th>Flag</th>
      <th>Default</th>
      <th>Description</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td><code>--disagg-p2p-hostname</code></td>
      <td><code>127.0.0.1</code></td>
      <td>RDMA-reachable hostname/IP of this instance</td>
    </tr>

    <tr>
      <td><code>--disagg-ib-device</code></td>
      <td><code>None</code></td>
      <td>InfiniBand device (e.g., <code>mlx5\_0</code>, <code>mlx5\_roce0</code>)</td>
    </tr>

    <tr>
      <td><code>--disagg-transfer-pool-size</code></td>
      <td>256 MiB</td>
      <td>Pinned memory pool per instance</td>
    </tr>
  </tbody>
</table>

Set `--disagg-p2p-hostname` to the actual IP on each machine. For multi-machine, `--disagg-ib-device` specifies the RDMA NIC.

## Per-Role Parallelism

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
  <colgroup>
    <col style={{width: "50%"}} />

    <col style={{width: "50%"}} />
  </colgroup>

  <thead>
    <tr>
      <th>Flag</th>
      <th>Description</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td><code>--encoder-tp</code></td>
      <td>Encoder tensor parallelism</td>
    </tr>

    <tr>
      <td><code>--denoiser-tp</code> / <code>--denoiser-sp</code> / <code>--denoiser-ulysses</code> / <code>--denoiser-ring</code></td>
      <td>Denoiser parallelism</td>
    </tr>

    <tr>
      <td><code>--decoder-tp</code></td>
      <td>Decoder tensor parallelism</td>
    </tr>
  </tbody>
</table>

If not specified, parallelism is auto-derived from `--num-gpus`.

## Other Options

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
  <colgroup>
    <col style={{width: "33.33%"}} />

    <col style={{width: "33.33%"}} />

    <col style={{width: "33.33%"}} />
  </colgroup>

  <thead>
    <tr>
      <th>Flag</th>
      <th>Default</th>
      <th>Description</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td><code>--disagg-timeout</code></td>
      <td><code>600</code></td>
      <td>Timeout (seconds) for pending requests</td>
    </tr>

    <tr>
      <td><code>--disagg-dispatch-policy</code></td>
      <td><code>round\_robin</code></td>
      <td><code>round\_robin</code> or <code>max\_free\_slots</code></td>
    </tr>
  </tbody>
</table>

## Python API

For programmatic single-machine deployment, `launch_pool_disagg_server()` is available:

```python theme={null}
from sglang.multimodal_gen.runtime.server_args import ServerArgs
from sglang.multimodal_gen.runtime.launch_server import launch_pool_disagg_server

server_args = ServerArgs.from_kwargs(
    model_path="Wan-AI/Wan2.1-T2V-14B-Diffusers",
    denoiser_sp=4, denoiser_ulysses=2, denoiser_ring=2,
    disagg_ib_device="mlx5_0",
)

launch_pool_disagg_server(
    server_args,
    encoder_gpus=[[0]],
    denoiser_gpus=[[1, 2, 3, 4], [5, 6, 7, 8]],
    decoder_gpus=[[0]],
)
```

## Architecture

```
Client ─── HTTP (port 30000) ──► FastAPI Server
                                      │
                                      ▼
                              DiffusionServer (ROUTER, scheduler_port)
                              ┌───────┼───────┐
                   PUSH work  │       │       │  PUSH work
                              ▼       │       ▼
                    Encoder[0..N]     │    Decoder[0..K]
                              │       │       ▲
                   P2P tensor │       │       │ P2P tensor
                   transfer   ▼       │       │ transfer
                          Denoiser[0..M] ─────┘
                                      │
                    PULL results ◄────┘  (decoder → DS → client)
```

### Request State Machine

```
PENDING → ENCODER_WAITING → ENCODER_RUNNING → ENCODER_DONE
                                                    │
                        DENOISING_WAITING → DENOISING_RUNNING → DENOISING_DONE
                                                                       │
                                    DECODER_WAITING → DECODER_RUNNING → DONE
```

Any state can transition to `FAILED` or `TIMED_OUT`.
