> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# SANA-WM

## 1. Model Introduction

[SANA-WM](https://huggingface.co/Efficient-Large-Model/SANA-WM_bidirectional) is an efficient open-source **world model** from NVLabs, trained natively for one-minute video generation. It is a **2.6B-parameter text+image-to-video (TI2V) diffusion transformer** that synthesizes **720p, minute-scale videos with precise 6-DoF camera control**, paired with an **LTX-2 refiner** for high-fidelity decoding. It builds on the [SANA](https://github.com/NVlabs/Sana) family — efficient high-resolution synthesis with a linear diffusion transformer.

SANA-WM ships in two checkpoints: a **bidirectional** checkpoint (dense, one-shot) and a **streaming** checkpoint (chunk-causal, autoregressive — generated chunk-by-chunk, reusing causal DiT state across chunks for bounded memory → long, even endless, clips). From a single first frame, a text prompt, and a camera trajectory, this cookbook covers **all three serving modes** SGLang exposes:

* **(A) Dense bidirectional** (§4) — the `SANA-WM_bidirectional` checkpoint generated in one shot (no chunking) via **`SanaWMTwoStagePipeline`** over the standard **`/v1/videos`** HTTP API. Highest single-clip quality (full bidirectional attention + dense LTX-2 refiner); matches the NVlabs dense reference.
* **(B) Batch streaming** (§5) — the `SANA-WM_streaming` checkpoint generated chunk-by-chunk in one request via the same **`SanaWMTwoStagePipeline`** + `--streaming` over **`/v1/videos`**. This is SGLang's offline chunk-causal streaming path: the whole clip is produced chunk-by-chunk internally, then returned.
* **(C) Live realtime** (§6–7) — the streaming pipeline exposed as **`SanaWMRealtimePipeline`** over a **WebSocket API** at `/v1/realtime_video/generate`, so a browser/client streams camera-action events frame-by-frame and receives video chunks back in real time. Realtime uses the same streaming checkpoint, but the incremental session path is not bit-identical to offline batch streaming.

All three modes share the camera action DSL (§8) and the configuration knobs (§9). Modes (B) and (C) share the streaming checkpoint and the chunk-causal pipeline.

**Key features** (per the official model):

* **Hybrid Linear Attention** — frame-wise Gated DeltaNet (GDN) recurrent blocks combined with softmax attention (every 4th layer, block indices {3,7,11,15,19}) for memory-efficient long-context modeling.
* **Dual-Branch Camera Control** — independent main and camera branches (UCPE + PRoPE) for precise per-frame 6-DoF trajectory adherence.
* **Two-Stage Pipeline** — an LTX-2 long-video refiner on top of Stage-1 latents for quality and temporal consistency.

In the **streaming / realtime** configuration this becomes a low-latency, interactive pipeline:

* **Stage-1 chunk-causal DiT** — the streaming path carries a **per-block KV cache** (recurrent GDN state + a softmax K/V window) across chunks; bounded memory means it scales to long / endless sequences. Stage-1 is intentionally coarse.
* **LTX-2 streaming refiner** — refines each Stage-1 latent chunk block-by-block with a **sink + sliding-history KV cache** (required for sharp output).
* **Causal LTX-2 VAE** — decodes latents chunk-by-chunk with a carried conv-cache for seam-free frames.
* **Camera control** — drive the camera with a compact **WASD/IJKL** action DSL (move with WASD, look with IJKL; see §8) — supplied at request time on the `/v1/videos` paths, or pushed over the WebSocket at init / as live per-chunk events on the realtime path (see §7).

**Architecture & components**

| Component   | Value                                                               |
| ----------- | ------------------------------------------------------------------- |
| Stage-1 DiT | 2.6B; 20 layers, hidden 2240, 20 heads (head\_dim 112); \~10 GB     |
| Attention   | frame-wise Gated DeltaNet + softmax every 4th block (hybrid linear) |
| Camera      | dual-branch, UCPE + PRoPE (raymap + Plücker), 6-DoF                 |
| VAE         | LTX-2 causal, strides (T, H, W) = (8, 32, 32); \~2 GB               |
| Refiner     | LTX-2 Stage-2 distilled; \~41 GB                                    |
| Output      | up to 720p (704×1280) @ 16 fps, minute-scale                        |

For more details, see the [SANA-WM paper (arXiv)](https://arxiv.org/abs/2605.15178), the [SANA project page](https://nvlabs.github.io/Sana/), the [NVlabs/Sana GitHub](https://github.com/NVlabs/Sana), and the [SANA-WM\_bidirectional model card](https://huggingface.co/Efficient-Large-Model/SANA-WM_bidirectional) (Apache-2.0).

## 2. Installation

SGLang-diffusion offers multiple installation methods depending on your hardware platform. Please refer to the [SGLang Diffusion installation guide](../../../docs/sglang-diffusion/installation).

SANA-WM adds the `SanaWMTransformer3DModel` + GDN kernels, the `SanaWMTwoStagePipeline` (dense bidirectional + chunk-causal streaming), and the `SanaWMRealtimePipeline` with the `/v1/realtime_video` WebSocket router. Use `sglang serve` to launch the diffusion server.

## 3. Model Setup

Both SANA-WM checkpoints are **public** (Apache-2.0, no gating, no token) and load **directly** — there is no manual assembly step. Pass the HuggingFace repo id to `--model-path` and SGLang downloads, materializes, validates, and loads it:

| Mode                                 | `--model-path`                                |
| ------------------------------------ | --------------------------------------------- |
| Dense bidirectional (§4)             | `Efficient-Large-Model/SANA-WM_bidirectional` |
| Batch streaming (§5) / realtime (§6) | `Efficient-Large-Model/SANA-WM_streaming`     |

Both repo ids are registered in SGLang's **built-in model-overlay registry**, so on first load the overlay transparently materializes the official release into a runnable Diffusers directory — for the streaming checkpoint this converts the DMD self-forcing checkpoint (`sana_dit/model.pt`) into a Diffusers `transformer/` and wires the LTX-2 causal VAE, the LTX-2 refiner, and the Gemma encoders. No environment variable or `build_model_dir.sh` step is needed. (You may also pass a local, already-materialized Diffusers directory.)

The materialized checkpoint is a Diffusers directory whose `model_index.json` declares the loadable components:

| Component (`model_index.json`) | Class                                       |
| ------------------------------ | ------------------------------------------- |
| `transformer` (Stage-1 DiT)    | `diffusers.SanaWMTransformer3DModel`        |
| `vae`                          | `diffusers.AutoencoderKLCausalLTX2Video`    |
| `text_encoder`                 | `transformers.Gemma2Model`                  |
| `tokenizer`                    | `transformers.GemmaTokenizer`               |
| `scheduler`                    | `diffusers.FlowMatchEulerDiscreteScheduler` |

How loading works:

* The server resolves the checkpoint via `maybe_download_model(model_path, force_diffusers_model=True)` and verifies it contains a `model_index.json` plus the required component subdirectories (`transformer/`, `vae/`).
* If `text_encoder` / `tokenizer` are not provided as component paths, the pipeline falls back to the default Stage-1 text encoder **`Efficient-Large-Model/gemma-2-2b-it`** (`DEFAULT_SANA_WM_TEXT_ENCODER`).
* **Pick the path with `--pipeline-class-name`.** The checkpoint's `model_index.json` `_class_name` selects the default pipeline (`SanaWMTwoStagePipeline`). Pin it explicitly to choose: `--pipeline-class-name SanaWMTwoStagePipeline` for the `/v1/videos` paths (§4–5) or `--pipeline-class-name SanaWMRealtimePipeline` for live realtime (§6). Pinning is also required if you point `--model-path` at a bare safetensors file instead of a Diffusers directory.
* **The Stage-2 LTX-2 refiner** lives under `refiner/` in the checkpoint: `refiner/transformer` (`transformer_2`), `refiner/connectors` (`connectors`), and `refiner/text_encoder` (the Gemma-3 encoder for `text_encoder_2`, whose tokenizer also serves as `tokenizer_2`). The refiner is **optional**: it is skipped (Stage-1-only output) when the env flag `SGLANG_SANA_WM_SKIP_REFINER` (or a `skip_refiner` request extra) is set, or when no `refiner/` is present (`transformer_2` unloaded). On the batch path it runs chunk-wise with `--refiner-chunked` (the official streaming path, default on) or whole-clip without it; on the realtime path the pipeline builds a `SanaWMChunkedRefinerChainStage` only when a refiner is available, and otherwise streams Stage-1 frames.

<Note>
  Throughout this cookbook, `<checkpoint>` stands for the appropriate SANA-WM repo id from the table above (or a local materialized Diffusers directory).
</Note>

## 4. Dense bidirectional (offline `/v1/videos`)

The **bidirectional** checkpoint generates the whole clip in **one shot** (full bidirectional attention, not chunked) followed by a dense LTX-2 refiner — the highest single-clip quality, matching the NVlabs dense reference.

Launch with the two-stage pipeline and **no** `--streaming` flag (dense is the default — `streaming` defaults to `False`):

```bash Command theme={null}
sglang serve \
  --model-path Efficient-Large-Model/SANA-WM_bidirectional \
  --pipeline-class-name SanaWMTwoStagePipeline \
  --host 127.0.0.1 --port 30000
```

Then POST to **`/v1/videos`** exactly as in §5, but pass the NVlabs dense sampling defaults for closest parity — the dense path is denser than the distilled streaming few-step schedule:

```bash Command theme={null}
curl -s http://127.0.0.1:30000/v1/videos \
  -H 'content-type: application/json' -d '{
    "prompt": "a camera moving forward and turning left",
    "input_reference": "/path/to/first_frame.png",
    "num_frames": 321,
    "seed": 42,
    "fps": 16,
    "num_inference_steps": 60,
    "guidance_scale": 5.0,
    "diffusers_kwargs": {
      "action": "w-80,wl-80,l-80,wj-80",
      "intrinsics": "/path/to/intrinsics.npy"
    }
  }'
```

* `num_inference_steps` / `guidance_scale` — the dense path uses CFG; NVlabs' reference defaults to **60 steps, guidance 5.0** (the `SanaWMSamplingParams` defaults are the lighter 20 / 4.5 — pass 60 / 5.0 explicitly for dense parity).
* The dense refiner drops the leading sink frame, so a `num_frames=321` request yields 320 output frames.

## 5. Batch streaming (offline `/v1/videos`)

The **streaming** checkpoint generates a **full camera-controlled clip in one request** — no websocket. This is SGLang's offline streaming path: the whole clip is generated chunk-by-chunk internally, refined, decoded, and returned as one video.

Launch with the two-stage pipeline + the streaming flags:

```bash Command theme={null}
sglang serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMTwoStagePipeline \
  --streaming --refiner-chunked \
  --host 127.0.0.1 --port 30000
```

* `--streaming` — chunk-causal `forward_long` Stage-1 (vs the dense one-shot path of §4).
* `--refiner-chunked` — chunk-wise streaming LTX-2 refiner (**on by default**). To use the whole-clip dense refiner instead (also valid, higher peak memory), pass `--refiner-chunked false` — simply omitting the flag keeps the default chunked refiner.
* `--num-frame-per-block N` — latent frames per chunk (default `3`).

Then POST to **`/v1/videos`** (JSON body shown below; multipart/form-data with an uploaded `input_reference` file also works). Camera control goes in `diffusers_kwargs` — the action-DSL string (§8) and the intrinsics:

```bash Command theme={null}
curl -s http://127.0.0.1:30000/v1/videos \
  -H 'content-type: application/json' -d '{
    "prompt": "a camera moving forward and turning left",
    "input_reference": "/path/to/first_frame.png",
    "num_frames": 321,
    "seed": 42,
    "fps": 16,
    "diffusers_kwargs": {
      "action": "w-80,wl-80,l-80,wj-80",
      "intrinsics": "/path/to/intrinsics.npy"
    }
  }'
```

| Field                         | Notes                                                                                                                                                                                                                 |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `prompt`                      | text prompt                                                                                                                                                                                                           |
| `input_reference`             | first-frame image — a server-side path, or (multipart) an uploaded file. For an `http(s)://` URL in a JSON body, use the separate `reference_url` field (the server downloads it and assigns it to `input_reference`) |
| `num_frames`                  | total pixel frames (e.g. `321` → 41 latent frames, 13 chunks; output 704×1280)                                                                                                                                        |
| `seed`                        | RNG seed (default `42`)                                                                                                                                                                                               |
| `fps`                         | output frame rate — **pass `16`** (SANA-WM's native rate). The generic `/v1/videos` default is `24`, which would encode the same frames at 24 fps and make the clip play \~33% shorter (16/24 of the duration)        |
| `diffusers_kwargs.action`     | camera action-DSL string (§8)                                                                                                                                                                                         |
| `diffusers_kwargs.intrinsics` | path to a camera-intrinsics `.npy` (per-frame `(T,3,3)`) or an inline 3×3 / `(T,3,3)` list                                                                                                                            |

The response is a `VideoResponse`; fetch the rendered MP4 via the returned reference or `GET /v1/videos/{id}/content`. The streaming hyperparameters (`num_frame_per_block`, `denoising_step_list`, `sink_size`, `num_cached_blocks`, `streaming_cfg_scale`) are **pipeline-config** defaults on `SanaWMPipelineConfig`, not request fields — see §9.

## 6. Launch the Realtime Server

Launch with the realtime pipeline **pinned** — the checkpoint defaults to `SanaWMTwoStagePipeline`, so realtime must be selected explicitly (see §3). The `/v1/realtime_video` router is always mounted and becomes functional once the realtime config is active, because `SanaWMRealtimeConfig` has a registered realtime adapter (`SanaWMRealtimeAdapter`).

```bash Command theme={null}
sglang serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --host 127.0.0.1 --port 30000
```

Common launch variants:

```bash Command theme={null}
# recommended multi-GPU realtime profile
sglang serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --num-gpus 8 --sp-degree 8 \
  --host 127.0.0.1 --port 30000

# single GPU
sglang serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --num-gpus 1 --host 127.0.0.1 --port 30000

# offload DiT + text encoder to CPU (tight VRAM)
sglang serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --host 127.0.0.1 --port 30000 \
  --dit-cpu-offload --text-encoder-cpu-offload
```

Notes on launch behavior:

* **Default endpoint** is `127.0.0.1:30000` (`--host` / `--port` override).
* **CPU offload flags are optional.** `--dit-cpu-offload`, `--text-encoder-cpu-offload`, and `--image-encoder-cpu-offload` are available; defaults are auto-adjusted from GPU memory (GPUs under 30 GB get more aggressive offloading).
* **Multi-GPU realtime.** Prefer explicit sequence parallelism (`--sp-degree` equal to the number of GPUs for a single session). Do not enable CFG parallel for the realtime profile: the default realtime request uses `guidance_scale=1.0`, while CFG parallel requires active cond/uncond branches.
* **FSDP.** Use `--use-fsdp-inference` only when you specifically need weight sharding for memory. For the low-latency realtime profile, prefer keeping components resident and using SP first.
* **Warmup.** Server warmup is **automatically skipped** for the realtime pipeline — a synthetic warmup request has no WebSocket session, so the server detects the registered realtime adapter and skips it. No `--warmup` flag is needed.

Once up, the realtime WebSocket endpoint lives at `ws://127.0.0.1:30000/v1/realtime_video/generate` (use the Python client in §7 to connect — plain `curl` does not speak the `ws://` upgrade).

## 7. Realtime WebSocket API

The realtime API is a single WebSocket at **`/v1/realtime_video/generate`**. All messages — client → server and server → client — are **msgpack** (`msgspec.msgpack.encode` / `decode`), not JSON.

The lifecycle is:

<Steps>
  <Step title="Connect & send INIT">
    The client opens the WebSocket and sends exactly one **init** message (`type: "init"`), carrying the prompt, the required `first_frame`, output/sampling options, and optional camera conditions in `condition_inputs`.
  </Step>

  <Step title="Stream live EVENTs (optional)">
    While generation runs, the client may push **event** messages (`type: "event"`) to steer the camera — either `kind: "camera_actions"` (frame-by-frame lists or state transitions) or `kind: "action"` (an action-DSL string).
  </Step>

  <Step title="Receive frame batches">
    The server streams **frame batches** back. Each chunk arrives as one or more `frame_batch` messages (header fields + payload bytes); `is_final_frame_batch: true` marks the end of a chunk. The server also emits `chunk_stats` timing messages.
  </Step>
</Steps>

### INIT message

`RealtimeVideoGenerationsRequest` (`type` is the literal `"init"`). Key fields:

| Field                                 | Type                            | Notes                                                                                                                                                                                                                                                                                                                                |
| ------------------------------------- | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `type`                                | `"init"`                        | Required literal                                                                                                                                                                                                                                                                                                                     |
| `prompt`                              | str                             | Text prompt                                                                                                                                                                                                                                                                                                                          |
| `first_frame`                         | bytes \| str                    | **Required by the SANA-WM adapter** (`on_init` raises if absent), though the generic request schema defines it as optional. Raw image bytes, a server-side path, or an `http(s)://` URL (downloaded & cached)                                                                                                                        |
| `condition_inputs`                    | dict                            | Camera/conditioning inputs (see below)                                                                                                                                                                                                                                                                                               |
| `num_frames`                          | int                             | Total frames to generate. **Omit it for an open-ended, continuous session** — the adapter leaves `num_frames` unset and flags an open-ended run (`condition_inputs["sana_wm_open_ended"] = True`), generating uniform chunks indefinitely (until `max_chunks` or the client disconnects). Provide an integer for a fixed-length clip |
| `seed`                                | int                             | RNG seed (default `42`)                                                                                                                                                                                                                                                                                                              |
| `size`                                | str                             | `"WIDTHxHEIGHT"`; realtime requests default to `"832x480"` for latency. Pass `"1280x704"` for the native landscape resolution                                                                                                                                                                                                        |
| `max_chunks`                          | int                             | Optional cap on total chunks generated                                                                                                                                                                                                                                                                                               |
| `num_inference_steps`                 | int                             | Default `4` for SANA-WM (realtime adapter)                                                                                                                                                                                                                                                                                           |
| `guidance_scale`                      | float                           | Default `1.0`                                                                                                                                                                                                                                                                                                                        |
| `realtime_output_format`              | `"raw"` \| `"webp"` \| `"jpeg"` | Frame encoding for output (see below)                                                                                                                                                                                                                                                                                                |
| `realtime_causal_sink_size`           | int                             | Optional override                                                                                                                                                                                                                                                                                                                    |
| `realtime_causal_kv_cache_num_frames` | int                             | Optional override                                                                                                                                                                                                                                                                                                                    |

`condition_inputs` accepts (all optional; pass **only one** of `action` / `camera_actions`):

| Key               | Type                                                       | Meaning                                                                                                              |
| ----------------- | ---------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| `camera_actions`  | `list[list[str]]` or `{mode: "state", transitions: [...]}` | Frame-by-frame camera actions, or state-based transitions                                                            |
| `action`          | str                                                        | Action-DSL string, e.g. `"w-10,none-5,a-8"` (see §8)                                                                 |
| `intrinsics_path` | str                                                        | Server-side path to a camera-intrinsics **`.npy`** file (loaded via `np.load`; shapes `(4,)`, `(3,3)`, or `(F,3,3)`) |
| `intrinsics`      | list                                                       | Inline intrinsics with shape `(4,)`, `(3,3)`, `(F,4)`, or `(F,3,3)`                                                  |

If you omit both `intrinsics_path` and `intrinsics`, SGLang uses a centered heuristic intrinsic matrix derived from the first-frame size. Pass explicit intrinsics when you need closer camera parity with a prepared trajectory.

```json INIT (msgpack dict) — open-ended (omit num_frames) theme={null}
{
  "type": "init",
  "prompt": "beautiful landscape video",
  "first_frame": "<bytes or url>",
  "size": "832x480",
  "seed": 42,
  "max_chunks": 10,
  "realtime_output_format": "raw",
  "num_inference_steps": 4,
  "guidance_scale": 1.0,
  "condition_inputs": {
    "camera_actions": [["w"], [], ["a", "s"]],
    "intrinsics_path": "/path/to/intrinsics.npy"
  }
}
```

### Live EVENT messages

`RealtimeEvent` (`type: "event"`). Use `kind` + `payload` (optional `event_id` correlates the response back to this event).

```json EVENT - camera_actions (frame-by-frame list[list[str]]) theme={null}
{
  "type": "event",
  "kind": "camera_actions",
  "event_id": 1,
  "payload": [["w"], ["w"], ["a"], []]
}
```

```json EVENT - camera_actions (state-based transitions) theme={null}
{
  "type": "event",
  "kind": "camera_actions",
  "event_id": 2,
  "payload": {
    "mode": "state",
    "transitions": [
      {"actions": ["w"], "client_ts_ms": 1000},
      {"actions": ["a", "w"], "client_ts_ms": 1500}
    ]
  }
}
```

```json EVENT - action (DSL string) theme={null}
{
  "type": "event",
  "kind": "action",
  "event_id": 3,
  "payload": "w-10,none-5,a-8,d-10"
}
```

### Server frame output

The server streams **frame batches**. Every batch arrives as a **single** msgpack message with `type: "frame_batch"` — the header fields below plus an inline `payload` bytes field (the wire `type` is always `"frame_batch"`; there is no separate header-then-bytes message).

Header fields:

| Field                                    | Meaning                                                                                    |
| ---------------------------------------- | ------------------------------------------------------------------------------------------ |
| `type`                                   | `"frame_batch"` (always)                                                                   |
| `request_id`                             | Generation id                                                                              |
| `chunk_index`                            | Chunk index                                                                                |
| `content_type`                           | `application/x-raw-rgb`, `application/x-raw-rgb-delta-gzip`, `image/webp`, or `image/jpeg` |
| `num_frames`                             | Frames in this batch                                                                       |
| `total_size`                             | Payload size in bytes (`len(payload)` — the compressed size for delta-gzip)                |
| `width`, `height`, `channels`            | Frame geometry (`channels: 3`)                                                             |
| `bytes_per_frame`                        | Bytes per uncompressed frame (`width*height*3`)                                            |
| `format`                                 | `rgb24` for raw                                                                            |
| `encoding`                               | `raw`, `delta-gzip`, `webp`, or `jpeg`                                                     |
| `delta_reference`                        | `previous-frame` (present for delta-gzip)                                                  |
| `event_id`                               | Echoes the steering event id; **omitted** from the header for INIT-only chunks             |
| `frame_batch_index`, `num_frame_batches` | Sequence multiple batches within a chunk                                                   |
| `is_final_frame_batch`                   | `true` ends the chunk                                                                      |

```json Server output - frame_batch (msgpack dict) theme={null}
{
  "type": "frame_batch",
  "request_id": "uuid-string",
  "chunk_index": 0,
  "content_type": "application/x-raw-rgb-delta-gzip",
  "num_frames": 3,
  "total_size": 1048576,
  "width": 1280,
  "height": 704,
  "channels": 3,
  "bytes_per_frame": 2703360,
  "format": "rgb24",
  "encoding": "delta-gzip",
  "delta_reference": "previous-frame",
  "event_id": 1,
  "frame_batch_index": 0,
  "num_frame_batches": 1,
  "is_final_frame_batch": true,
  "payload": "<gzip-compressed bytes>"
}
```

**Encodings.** `application/x-raw-rgb` is uncompressed RGB24 (3 × uint8, `bytes_per_frame = width*height*3`). `application/x-raw-rgb-delta-gzip` is the zlib-compressed **per-frame XOR delta** against the preceding frame (each frame in the batch is XOR'd against the previous one; sent by default). `realtime_output_format: "raw"` forces uncompressed RGB; `"webp"` / `"jpeg"` send preview-encoded frames.

<Note>
  delta-gzip must be restored **frame-by-frame**: decompress the payload, then for each frame XOR it against the already-restored previous frame (the first frame of a batch references the last frame of the previous batch). See `restore_delta_gzip_raw_rgb_payload` in `runtime/utils/realtime_video.py`. The `"raw"` format below avoids this.
</Note>

### Minimal client example

```python Python theme={null}
import msgspec
import numpy as np
import websockets  # pip install websockets

WS_URL = "ws://127.0.0.1:30000/v1/realtime_video/generate"

async def run():
    async with websockets.connect(WS_URL, max_size=None) as ws:
        # 1) INIT — omit num_frames for an open-ended session; "raw" = uncompressed RGB24
        with open("first_frame.png", "rb") as f:
            first_frame = f.read()
        await ws.send(msgspec.msgpack.encode({
            "type": "init",
            "prompt": "a camera moving forward and turning right",
            "first_frame": first_frame,
            "size": "832x480",
            "seed": 42,
            "max_chunks": 10,
            "realtime_output_format": "raw",
            "num_inference_steps": 4,
            "guidance_scale": 1.0,
            "condition_inputs": {
                "action": "w-100,wd-50,d-30",
                "intrinsics_path": "/path/to/intrinsics.npy",  # optional; centered heuristic if omitted
            },
        }))

        # 2) optional: steer mid-stream
        await ws.send(msgspec.msgpack.encode({
            "type": "event",
            "kind": "camera_actions",
            "event_id": 1,
            "payload": [["w"], ["w"], ["a"], []],
        }))

        # 3) receive frame batches (raw RGB24)
        async for message in ws:
            msg = msgspec.msgpack.decode(message)
            if msg.get("type") != "frame_batch":
                continue  # skip chunk_stats etc.
            n, h, w, c = msg["num_frames"], msg["height"], msg["width"], msg["channels"]
            frames = np.frombuffer(msg["payload"], dtype=np.uint8).reshape(n, h, w, c)
            # ... display/save frames ...
            if msg.get("is_final_frame_batch") and msg.get("chunk_index", 0) >= 9:
                break

# asyncio.run(run())
```

## 8. Camera Action DSL

Camera trajectories are described by a compact string of comma-separated `<keys>-<frames>` segments, e.g. `"w-100,wd-50,d-30,none-10"`. This is the format accepted by `condition_inputs.action` at init and by `kind: "action"` events.

Parsing rules (`parse_action_string`):

* Each segment is `<keys>-<frames>`; `<frames>` must be a positive integer.
* `none` means no motion for that span: `none-10` = 10 static frames.
* Keys are case-insensitive; combined keys apply simultaneously (`wd` = forward + right strafe). Allowed keys are exactly `wasdijkl`.

| Key       | Motion                  |
| --------- | ----------------------- |
| `w` / `s` | move forward / backward |
| `a` / `d` | strafe left / right     |
| `i` / `k` | look (pitch) up / down  |
| `j` / `l` | look (yaw) left / right |

Pose generation (`action_string_to_c2w`):

* **Translation** (`w`/`s`/`a`/`d`) moves at `translation_speed` (default `0.04` world-units/frame).
* **Rotation** (`i`/`k` pitch, `j`/`l` yaw) turns at `rotation_speed_deg` (default `1.2`°/frame); pitch is clamped to ±85°.
* **Strafe-yaw coupling** (coefficient `0.4`): a `d` (right) strafe also nudges yaw right and `a` (left) nudges yaw left, so `wd` traces a curving arc rather than a pure sidestep.
* Produces `(F+1, 4, 4)` camera-to-world matrices; the realtime stage pads the trajectory to the requested frame count.

Example: `"w-100,wd-50,d-30,none-10"` = 100 frames forward → 50 frames forward + sweep right → 30 frames right strafe → 10 frames static.

## 9. Configuration Reference

SANA-WM's defaults live in three places: **request-time** sampling params, the **pipeline config** (streaming/refiner knobs), and the **realtime adapter** (init-time overrides).

### Request-time — `SanaWMSamplingParams` (`configs/sample/sana_wm.py`)

| Field                 | Default | Purpose                                                               |
| --------------------- | ------- | --------------------------------------------------------------------- |
| `height`              | `704`   | Output height                                                         |
| `width`               | `1280`  | Output width                                                          |
| `num_frames`          | `49`    | Total pixel frames (must satisfy `(num_frames - 1) % 8 == 0`)         |
| `fps`                 | `16`    | Output frame rate (overrides the base default of 24)                  |
| `num_inference_steps` | `20`    | Stage-1 step count                                                    |
| `guidance_scale`      | `4.5`   | Dense-path CFG scale                                                  |
| `negative_prompt`     | `""`    | Negative prompt                                                       |
| `camera_to_world`     | `None`  | In-memory `(T,4,4)` c2w extrinsics (mutually exclusive with `action`) |
| `intrinsics`          | `None`  | In-memory `(T,3,3)` pinhole intrinsics                                |
| `action`              | `None`  | Action-DSL string (see §8)                                            |
| `translation_speed`   | `0.04`  | World-units/frame for W/S/A/D                                         |
| `rotation_speed_deg`  | `1.2`   | Degrees/frame for I/K/J/L                                             |
| `pitch_limit_deg`     | `85.0`  | Pitch clamp                                                           |

`generator_device` is inherited from the base `SamplingParams` (default `None` = use the pipeline/model default). On the `/v1/videos` HTTP API the camera fields are passed inside `diffusers_kwargs` (`action` / `intrinsics`, as in §4–5).

### Pipeline config — `SanaWMPipelineConfig` (`configs/pipeline_configs/sana_wm.py`)

These are server-launch knobs (set via the `--streaming` / `--refiner-chunked` / `--num-frame-per-block` CLI flags or a pipeline-config override), **not** request fields:

| Field                   | Default                    | Purpose                                                  |
| ----------------------- | -------------------------- | -------------------------------------------------------- |
| `streaming`             | `False`                    | Chunk-causal `forward_long` (§5) vs dense one-shot (§4)  |
| `refiner_chunked`       | `True`                     | Chunk-wise streaming refiner vs whole-clip dense refiner |
| `num_frame_per_block`   | `3`                        | Latent frames per Stage-1 / refiner chunk                |
| `num_cached_blocks`     | `2`                        | Rolling KV-cache history window                          |
| `denoising_step_list`   | `(1000, 960, 889, 727, 0)` | 4-step streaming self-forcing timesteps (must end in 0)  |
| `streaming_cfg_scale`   | `1.0`                      | CFG scale for the distilled streaming path (1.0 = off)   |
| `sink_size`             | `1`                        | Sink (unrefined context) frames                          |
| `refiner_block_size`    | `3`                        | Refiner block size                                       |
| `refiner_kv_max_frames` | `11`                       | Refiner sliding KV window                                |

### Realtime adapter init overrides — `SanaWMRealtimeAdapter`

At WebSocket `init` the realtime adapter fills SANA-WM defaults that differ from the request/sampling defaults above:

| Field                 | Realtime default | Note                                                                  |
| --------------------- | ---------------- | --------------------------------------------------------------------- |
| `size`                | `832x480`        | Realtime request default; pass `1280x704` for native landscape output |
| `num_frames`          | *(unset)*        | Omitting → open-ended continuous session (§7)                         |
| `num_inference_steps` | `4`              | Distilled few-step                                                    |
| `guidance_scale`      | `1.0`            | CFG off                                                               |
| `fps`                 | `16`             | Native rate                                                           |

<Note>
  `guidance_scale` applies to the dense path (§4) only; the distilled streaming path uses `streaming_cfg_scale` (default `1.0`, i.e. no CFG) so a `guidance_scale` override never accidentally enables CFG on the streaming stage. `denoising_step_list = (1000, 960, 889, 727, 0)` is the official 4-step streaming schedule (it must end in 0).
</Note>