> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# LingBot World

## 1. Model Introduction

[LingBot World](https://huggingface.co/robbyant/lingbot-world-fast-diffusers) is a realtime camera-controlled video world model. In SGLang-diffusion, it belongs to the realtime causal path: the server keeps a live session, samples control signals per chunk, reuses causal DiT state, and decodes video frames incrementally.

<div style={{border: "1px solid #dbe3ef", borderRadius: "16px", overflow: "hidden", background: "linear-gradient(135deg, #ffffff 0%, #f8fafc 48%, #ecfeff 100%)", boxShadow: "0 14px 36px rgba(15, 23, 42, 0.08)", margin: "22px 0 28px 0"}}>
  <div style={{display: "flex", alignItems: "center", justifyContent: "space-between", gap: "16px", padding: "18px 22px", borderBottom: "1px solid #e2e8f0"}}>
    <div>
      <div style={{fontSize: "20px", fontWeight: 750, color: "#0f172a", lineHeight: 1.2}}>LingBot World</div>
      <div style={{fontSize: "13px", color: "#64748b", marginTop: "5px"}}>Realtime diffusion world model in SGLang-diffusion</div>
    </div>

    <div style={{display: "inline-flex", alignItems: "center", whiteSpace: "nowrap", border: "1px solid #bae6fd", background: "#e0f2fe", color: "#0369a1", borderRadius: "999px", padding: "8px 12px", fontSize: "13px", fontWeight: 800}}>
      prompt + image + control → streaming video
    </div>
  </div>

  <div style={{display: "grid", gridTemplateColumns: "170px 1fr"}}>
    <div style={{padding: "12px 22px", borderBottom: "1px solid #edf2f7", color: "#64748b", fontWeight: 700, background: "rgba(248, 250, 252, 0.72)"}}>Category</div>

    <div style={{padding: "12px 22px", borderBottom: "1px solid #edf2f7", color: "#1e293b"}}>
      <span style={{display: "inline-block", marginRight: "6px", padding: "4px 9px", borderRadius: "999px", background: "#ecfeff", color: "#155e75", fontSize: "12px", fontWeight: 750}}>realtime</span>
      <span style={{display: "inline-block", marginRight: "6px", padding: "4px 9px", borderRadius: "999px", background: "#ecfeff", color: "#155e75", fontSize: "12px", fontWeight: 750}}>world model</span>
      <span style={{display: "inline-block", marginRight: "6px", padding: "4px 9px", borderRadius: "999px", background: "#ecfeff", color: "#155e75", fontSize: "12px", fontWeight: 750}}>causal DiT</span>
    </div>

    <div style={{padding: "12px 22px", borderBottom: "1px solid #edf2f7", color: "#64748b", fontWeight: 700, background: "rgba(248, 250, 252, 0.72)"}}>Inputs</div>
    <div style={{padding: "12px 22px", borderBottom: "1px solid #edf2f7", color: "#1e293b"}}>Prompt, first frame, and per-chunk camera control signals</div>
    <div style={{padding: "12px 22px", borderBottom: "1px solid #edf2f7", color: "#64748b", fontWeight: 700, background: "rgba(248, 250, 252, 0.72)"}}>Outputs</div>
    <div style={{padding: "12px 22px", borderBottom: "1px solid #edf2f7", color: "#1e293b"}}>Streaming video frame chunks over <code>/v1/realtime\_video/generate</code></div>
    <div style={{padding: "12px 22px", color: "#64748b", fontWeight: 700, background: "rgba(248, 250, 252, 0.72)"}}>Core runtime</div>
    <div style={{padding: "12px 22px", color: "#1e293b"}}>Condition queue, causal DiT KV cache, causal VAE decode cache, and realtime session state</div>
  </div>
</div>

This is different from offline diffusion video models such as Wan or LTX. Offline models denoise a bounded latent sequence for one request. Realtime world models generate a continuing stream, so the runtime must manage session state, control events, causal attention cache, and VAE decode cache.

## 2. Deployment

```bash Command theme={null}
sglang serve \
  --model-path robbyant/lingbot-world-fast-diffusers \
  --pipeline-class-name LingBotWorldCausalDMDPipeline \
  --num-gpus 4 \
  --ulysses-degree 4 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false
```

## 3. Realtime WebUI

The lightweight local WebUI is useful for validating latency, frame transport, and camera control behavior.

```bash Command theme={null}
python -m http.server 18080 -d python/sglang/multimodal_gen/apps/realtime_webui
```

Open `http://127.0.0.1:18080` and use:

```text Example theme={null}
ws://127.0.0.1:30000/v1/realtime_video/generate
```

## 4. HTTP and WebSocket API

LingBot World uses the realtime video WebSocket endpoint. The server keeps one live session, generates one chunk at a time, and accepts runtime control events while generation is running.

### Endpoints

| API                           | Method      | Purpose                                                                | Notes                                                                               |
| ----------------------------- | ----------- | ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| `/v1/models`                  | `GET`       | Query the served model id before opening a session.                    | The WebUI uses this to fill the model field when the server exposes model metadata. |
| `/v1/realtime_video/generate` | `WebSocket` | Create one realtime LingBot session and stream generated video chunks. | The first client message must be an `init` message encoded with MessagePack.        |

### `init` message

Send this MessagePack map immediately after the WebSocket opens.

| Parameter                             | Type                        | Required | Meaning                                                                                                                         |
| ------------------------------------- | --------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `type`                                | string                      | Yes      | Must be `"init"`.                                                                                                               |
| `model`                               | string                      | No       | Model id. Leave empty to use the served model.                                                                                  |
| `prompt`                              | string                      | Yes      | Text prompt for the initial scene and motion style.                                                                             |
| `first_frame`                         | bytes or string             | Yes      | Initial reference image. Send bytes from the WebUI/client, or a server-readable image path/string.                              |
| `size`                                | string                      | Yes      | Generation size as `WIDTHxHEIGHT`, for example `832x480`.                                                                       |
| `fps`                                 | number                      | Yes      | Target playback FPS for the generated stream.                                                                                   |
| `num_frames`                          | integer                     | Yes      | Frames per generated chunk. LingBot uses chunked causal generation, so this controls per-chunk latency and queue size.          |
| `seed`                                | integer                     | No       | Random seed for deterministic sampling.                                                                                         |
| `num_inference_steps`                 | integer                     | No       | Denoising steps per chunk. LingBot defaults to `4` when omitted.                                                                |
| `guidance_scale`                      | number                      | No       | Classifier-free guidance scale. Realtime LingBot commonly uses `1`.                                                             |
| `negative_prompt`                     | string                      | No       | Negative prompt passed to the diffusion pipeline.                                                                               |
| `max_chunks`                          | integer                     | No       | Stop after this many chunks. Omit for a continuous session.                                                                     |
| `realtime_causal_sink_size`           | integer                     | No       | Number of sink frames/tokens retained in the causal attention window.                                                           |
| `realtime_causal_kv_cache_num_frames` | integer                     | No       | Number of recent frames retained in the causal KV cache window.                                                                 |
| `realtime_output_format`              | `"webp"`, `"jpeg"`, `"raw"` | No       | Preview/output transport. `webp` and `jpeg` send encoded preview frames; `raw` sends raw RGB; omit for lossless delta-gzip RGB. |
| `output_compression`                  | integer                     | No       | Preview quality for `webp` or `jpeg`, from `1` to `100`.                                                                        |
| `enable_upscaling`                    | boolean                     | No       | Enable server-side super resolution after frame decode.                                                                         |
| `upscaling_scale`                     | integer                     | No       | Super-resolution scale. Current default is `4` when upscaling is enabled.                                                       |
| `upscaling_model_path`                | string                      | No       | Optional Real-ESRGAN model path.                                                                                                |
| `enable_frame_interpolation`          | boolean                     | No       | Enable frame interpolation. Keep this disabled when measuring true generated FPS.                                               |
| `frame_interpolation_exp`             | integer                     | No       | Interpolation multiplier exponent. `1` means 2x frames.                                                                         |
| `frame_interpolation_scale`           | number                      | No       | RIFE internal scale for interpolation.                                                                                          |
| `frame_interpolation_model_path`      | string                      | No       | Optional RIFE model path.                                                                                                       |
| `condition_inputs.camera_actions`     | `list[list[string]]`        | No       | Initial scripted camera actions, one action list per frame.                                                                     |

### Runtime `event` messages

After `init`, send MessagePack event maps to update the live session.

| Parameter  | Type                             | Required | Meaning                                                                                                             |
| ---------- | -------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------- |
| `type`     | string                           | Yes      | Must be `"event"`.                                                                                                  |
| `kind`     | `"prompt"` or `"camera_actions"` | Yes      | Runtime event kind.                                                                                                 |
| `payload`  | string or object/list            | Yes      | For `prompt`, a non-empty string. For `camera_actions`, either scripted `list[list[string]]` or state-mode payload. |
| `event_id` | integer                          | No       | Client sequence id. The server echoes it in chunk/frame metadata after the event is sampled.                        |

`camera_actions` supports two payload modes:

| Mode   | Payload shape                                                                   | Meaning                                                            |
| ------ | ------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| Script | `list[list[string]]`                                                            | A fixed sequence of per-frame actions consumed by upcoming chunks. |
| State  | `{ "mode": "state", "transitions": [{"actions": [...], "client_ts_ms": ...}] }` | Live control state transitions from keyboard or UI controls.       |

Supported LingBot action tokens include `w`, `a`, `s`, `d` for camera movement and `i`, `j`, `k`, `l` for look controls.

### Server messages

| Message       | Payload                              | Meaning                                                                                                                                                   |
| ------------- | ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `frame_batch` | MessagePack map with `payload` bytes | One batch of frames. The map includes `chunk_index`, `num_frames`, `content_type`, `encoding`, `width`, `height`, and frame-batch metadata.               |
| `chunk_stats` | MessagePack map                      | Per-chunk timing and transport metrics, including `scheduler_forward_ms`, `raw_payload_build_ms`, `chunk_total_ms`, `num_frames`, and `ws_payload_bytes`. |
| `error`       | MessagePack map                      | Server-side validation or generation error.                                                                                                               |

### Minimal client sketch

```python Python theme={null}
import msgspec.msgpack
import websocket

ws = websocket.create_connection("ws://127.0.0.1:30000/v1/realtime_video/generate")
ws.send_binary(msgspec.msgpack.encode({
    "type": "init",
    "prompt": "A quiet rainy London alley, stable camera motion.",
    "first_frame": open("reference.jpg", "rb").read(),
    "size": "832x480",
    "fps": 25,
    "num_frames": 9,
    "num_inference_steps": 4,
    "guidance_scale": 1,
    "realtime_output_format": "webp",
    "output_compression": 95,
}))

ws.send_binary(msgspec.msgpack.encode({
    "type": "event",
    "kind": "camera_actions",
    "event_id": 1,
    "payload": {"mode": "state", "transitions": [{"actions": ["w"], "client_ts_ms": 0}]},
}))
```

## 5. Consistency

LingBot World uses raw-frame websocket GT plus per-chunk latency guards for consistency checks.

## 6. Notes

* Use the realtime endpoint for interactive sessions: `/v1/realtime_video/generate`.
* Prefer WebP preview transport for interactive testing; use raw-frame transport for consistency checks.
* Long-running sessions should be validated with raw-frame consistency before changing causal cache, condition sampling, or VAE decode behavior.
