> ## Documentation Index > Fetch the complete documentation index at: https://docs.sglang.io/llms.txt > Use this file to discover all available pages before exploring further. # LingBot World ## 1. Model Introduction [LingBot World](https://huggingface.co/robbyant/lingbot-world-fast-diffusers) is a realtime camera-controlled video world model. In SGLang-diffusion, it belongs to the realtime causal path: the server keeps a live session, samples control signals per chunk, reuses causal DiT state, and decodes video frames incrementally.

LingBot World

Realtime diffusion world model in SGLang-diffusion

prompt + image + control → streaming video

Category

realtime world model causal DiT

Inputs

Prompt, first frame, and per-chunk camera control signals

Outputs

Streaming video frame chunks over /v1/realtime\_video/generate

Core runtime

Condition queue, causal DiT KV cache, causal VAE decode cache, and realtime session state

This is different from offline diffusion video models such as Wan or LTX. Offline models denoise a bounded latent sequence for one request. Realtime world models generate a continuing stream, so the runtime must manage session state, control events, causal attention cache, and VAE decode cache. ## 2. Deployment ```bash Command theme={null} sglang serve \ --model-path robbyant/lingbot-world-fast-diffusers \ --pipeline-class-name LingBotWorldCausalDMDPipeline \ --num-gpus 4 \ --ulysses-degree 4 \ --dit-cpu-offload false \ --text-encoder-cpu-offload false ``` ## 3. Realtime WebUI The lightweight local WebUI is useful for validating latency, frame transport, and camera control behavior. ```bash Command theme={null} python -m http.server 18080 -d python/sglang/multimodal_gen/apps/realtime_webui ``` Open `http://127.0.0.1:18080` and use: ```text Example theme={null} ws://127.0.0.1:30000/v1/realtime_video/generate ``` ## 4. HTTP and WebSocket API LingBot World uses the realtime video WebSocket endpoint. The server keeps one live session, generates one chunk at a time, and accepts runtime control events while generation is running. ### Endpoints | API | Method | Purpose | Notes | | ----------------------------- | ----------- | ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------- | | `/v1/models` | `GET` | Query the served model id before opening a session. | The WebUI uses this to fill the model field when the server exposes model metadata. | | `/v1/realtime_video/generate` | `WebSocket` | Create one realtime LingBot session and stream generated video chunks. | The first client message must be an `init` message encoded with MessagePack. | ### `init` message Send this MessagePack map immediately after the WebSocket opens. | Parameter | Type | Required | Meaning | | ------------------------------------- | --------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------- | | `type` | string | Yes | Must be `"init"`. | | `model` | string | No | Model id. Leave empty to use the served model. | | `prompt` | string | Yes | Text prompt for the initial scene and motion style. | | `first_frame` | bytes or string | Yes | Initial reference image. Send bytes from the WebUI/client, or a server-readable image path/string. | | `size` | string | Yes | Generation size as `WIDTHxHEIGHT`, for example `832x480`. | | `fps` | number | Yes | Target playback FPS for the generated stream. | | `num_frames` | integer | Yes | Frames per generated chunk. LingBot uses chunked causal generation, so this controls per-chunk latency and queue size. | | `seed` | integer | No | Random seed for deterministic sampling. | | `num_inference_steps` | integer | No | Denoising steps per chunk. LingBot defaults to `4` when omitted. | | `guidance_scale` | number | No | Classifier-free guidance scale. Realtime LingBot commonly uses `1`. | | `negative_prompt` | string | No | Negative prompt passed to the diffusion pipeline. | | `max_chunks` | integer | No | Stop after this many chunks. Omit for a continuous session. | | `realtime_causal_sink_size` | integer | No | Number of sink frames/tokens retained in the causal attention window. | | `realtime_causal_kv_cache_num_frames` | integer | No | Number of recent frames retained in the causal KV cache window. | | `realtime_output_format` | `"webp"`, `"jpeg"`, `"raw"` | No | Preview/output transport. `webp` and `jpeg` send encoded preview frames; `raw` sends raw RGB; omit for lossless delta-gzip RGB. | | `output_compression` | integer | No | Preview quality for `webp` or `jpeg`, from `1` to `100`. | | `enable_upscaling` | boolean | No | Enable server-side super resolution after frame decode. | | `upscaling_scale` | integer | No | Super-resolution scale. Current default is `4` when upscaling is enabled. | | `upscaling_model_path` | string | No | Optional Real-ESRGAN model path. | | `enable_frame_interpolation` | boolean | No | Enable frame interpolation. Keep this disabled when measuring true generated FPS. | | `frame_interpolation_exp` | integer | No | Interpolation multiplier exponent. `1` means 2x frames. | | `frame_interpolation_scale` | number | No | RIFE internal scale for interpolation. | | `frame_interpolation_model_path` | string | No | Optional RIFE model path. | | `condition_inputs.camera_actions` | `list[list[string]]` | No | Initial scripted camera actions, one action list per frame. | ### Runtime `event` messages After `init`, send MessagePack event maps to update the live session. | Parameter | Type | Required | Meaning | | ---------- | -------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------- | | `type` | string | Yes | Must be `"event"`. | | `kind` | `"prompt"` or `"camera_actions"` | Yes | Runtime event kind. | | `payload` | string or object/list | Yes | For `prompt`, a non-empty string. For `camera_actions`, either scripted `list[list[string]]` or state-mode payload. | | `event_id` | integer | No | Client sequence id. The server echoes it in chunk/frame metadata after the event is sampled. | `camera_actions` supports two payload modes: | Mode | Payload shape | Meaning | | ------ | ------------------------------------------------------------------------------- | ------------------------------------------------------------------ | | Script | `list[list[string]]` | A fixed sequence of per-frame actions consumed by upcoming chunks. | | State | `{ "mode": "state", "transitions": [{"actions": [...], "client_ts_ms": ...}] }` | Live control state transitions from keyboard or UI controls. | Supported LingBot action tokens include `w`, `a`, `s`, `d` for camera movement and `i`, `j`, `k`, `l` for look controls. ### Server messages | Message | Payload | Meaning | | ------------- | ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------- | | `frame_batch` | MessagePack map with `payload` bytes | One batch of frames. The map includes `chunk_index`, `num_frames`, `content_type`, `encoding`, `width`, `height`, and frame-batch metadata. | | `chunk_stats` | MessagePack map | Per-chunk timing and transport metrics, including `scheduler_forward_ms`, `raw_payload_build_ms`, `chunk_total_ms`, `num_frames`, and `ws_payload_bytes`. | | `error` | MessagePack map | Server-side validation or generation error. | ### Minimal client sketch ```python Python theme={null} import msgspec.msgpack import websocket ws = websocket.create_connection("ws://127.0.0.1:30000/v1/realtime_video/generate") ws.send_binary(msgspec.msgpack.encode({ "type": "init", "prompt": "A quiet rainy London alley, stable camera motion.", "first_frame": open("reference.jpg", "rb").read(), "size": "832x480", "fps": 25, "num_frames": 9, "num_inference_steps": 4, "guidance_scale": 1, "realtime_output_format": "webp", "output_compression": 95, })) ws.send_binary(msgspec.msgpack.encode({ "type": "event", "kind": "camera_actions", "event_id": 1, "payload": {"mode": "state", "transitions": [{"actions": ["w"], "client_ts_ms": 0}]}, })) ``` ## 5. Consistency LingBot World uses raw-frame websocket GT plus per-chunk latency guards for consistency checks. ## 6. Notes * Use the realtime endpoint for interactive sessions: `/v1/realtime_video/generate`. * Prefer WebP preview transport for interactive testing; use raw-frame transport for consistency checks. * Long-running sessions should be validated with raw-frame consistency before changing causal cache, condition sampling, or VAE decode behavior.