1. Model Introduction
LingBot World is a realtime camera-controlled video world model. In SGLang-diffusion, it belongs to the realtime causal path: the server keeps a live session, samples control signals per chunk, reuses causal DiT state, and decodes video frames incrementally. This is different from offline diffusion video models such as Wan or LTX. Offline models denoise a bounded latent sequence for one request. Realtime world models generate a continuing stream, so the runtime must manage session state, control events, causal attention cache, and VAE decode cache.2. Deployment
Command
3. Realtime WebUI
The lightweight local WebUI is useful for validating latency, frame transport, and camera control behavior.Command
http://127.0.0.1:18080 and use:
Example
4. HTTP and WebSocket API
LingBot World uses the realtime video WebSocket endpoint. The server keeps one live session, generates one chunk at a time, and accepts runtime control events while generation is running.Endpoints
| API | Method | Purpose | Notes |
|---|---|---|---|
/v1/models | GET | Query the served model id before opening a session. | The WebUI uses this to fill the model field when the server exposes model metadata. |
/v1/realtime_video/generate | WebSocket | Create one realtime LingBot session and stream generated video chunks. | The first client message must be an init message encoded with MessagePack. |
init message
Send this MessagePack map immediately after the WebSocket opens.
| Parameter | Type | Required | Meaning |
|---|---|---|---|
type | string | Yes | Must be "init". |
model | string | No | Model id. Leave empty to use the served model. |
prompt | string | Yes | Text prompt for the initial scene and motion style. |
first_frame | bytes or string | Yes | Initial reference image. Send bytes from the WebUI/client, or a server-readable image path/string. |
size | string | Yes | Generation size as WIDTHxHEIGHT, for example 832x480. |
fps | number | Yes | Target playback FPS for the generated stream. |
num_frames | integer | Yes | Frames per generated chunk. LingBot uses chunked causal generation, so this controls per-chunk latency and queue size. |
seed | integer | No | Random seed for deterministic sampling. |
num_inference_steps | integer | No | Denoising steps per chunk. LingBot defaults to 4 when omitted. |
guidance_scale | number | No | Classifier-free guidance scale. Realtime LingBot commonly uses 1. |
negative_prompt | string | No | Negative prompt passed to the diffusion pipeline. |
max_chunks | integer | No | Stop after this many chunks. Omit for a continuous session. |
realtime_causal_sink_size | integer | No | Number of sink frames/tokens retained in the causal attention window. |
realtime_causal_kv_cache_num_frames | integer | No | Number of recent frames retained in the causal KV cache window. |
realtime_output_format | "webp", "jpeg", "raw" | No | Preview/output transport. webp and jpeg send encoded preview frames; raw sends raw RGB; omit for lossless delta-gzip RGB. |
output_compression | integer | No | Preview quality for webp or jpeg, from 1 to 100. |
enable_upscaling | boolean | No | Enable server-side super resolution after frame decode. |
upscaling_scale | integer | No | Super-resolution scale. Current default is 4 when upscaling is enabled. |
upscaling_model_path | string | No | Optional Real-ESRGAN model path. |
enable_frame_interpolation | boolean | No | Enable frame interpolation. Keep this disabled when measuring true generated FPS. |
frame_interpolation_exp | integer | No | Interpolation multiplier exponent. 1 means 2x frames. |
frame_interpolation_scale | number | No | RIFE internal scale for interpolation. |
frame_interpolation_model_path | string | No | Optional RIFE model path. |
condition_inputs.camera_actions | list[list[string]] | No | Initial scripted camera actions, one action list per frame. |
Runtime event messages
After init, send MessagePack event maps to update the live session.
| Parameter | Type | Required | Meaning |
|---|---|---|---|
type | string | Yes | Must be "event". |
kind | "prompt" or "camera_actions" | Yes | Runtime event kind. |
payload | string or object/list | Yes | For prompt, a non-empty string. For camera_actions, either scripted list[list[string]] or state-mode payload. |
event_id | integer | No | Client sequence id. The server echoes it in chunk/frame metadata after the event is sampled. |
camera_actions supports two payload modes:
| Mode | Payload shape | Meaning |
|---|---|---|
| Script | list[list[string]] | A fixed sequence of per-frame actions consumed by upcoming chunks. |
| State | { "mode": "state", "transitions": [{"actions": [...], "client_ts_ms": ...}] } | Live control state transitions from keyboard or UI controls. |
w, a, s, d for camera movement and i, j, k, l for look controls.
Server messages
| Message | Payload | Meaning |
|---|---|---|
frame_batch | MessagePack map with payload bytes | One batch of frames. The map includes chunk_index, num_frames, content_type, encoding, width, height, and frame-batch metadata. |
chunk_stats | MessagePack map | Per-chunk timing and transport metrics, including scheduler_forward_ms, raw_payload_build_ms, chunk_total_ms, num_frames, and ws_payload_bytes. |
error | MessagePack map | Server-side validation or generation error. |
Minimal client sketch
Python
5. Consistency
LingBot World uses raw-frame websocket GT plus per-chunk latency guards for consistency checks.6. Notes
- Use the realtime endpoint for interactive sessions:
/v1/realtime_video/generate. - Prefer WebP preview transport for interactive testing; use raw-frame transport for consistency checks.
- Long-running sessions should be validated with raw-frame consistency before changing causal cache, condition sampling, or VAE decode behavior.
