Skip to main content

1. Model Introduction

LingBot World is a realtime camera-controlled video world model. In SGLang-diffusion, it belongs to the realtime causal path: the server keeps a live session, samples control signals per chunk, reuses causal DiT state, and decodes video frames incrementally.
LingBot World
Realtime diffusion world model in SGLang-diffusion
prompt + image + control → streaming video
Category
realtimeworld modelcausal DiT
Inputs
Prompt, first frame, and per-chunk camera control signals
Outputs
Streaming video frame chunks over /v1/realtime_video/generate
Core runtime
Condition queue, causal DiT KV cache, causal VAE decode cache, and realtime session state
This is different from offline diffusion video models such as Wan or LTX. Offline models denoise a bounded latent sequence for one request. Realtime world models generate a continuing stream, so the runtime must manage session state, control events, causal attention cache, and VAE decode cache.

2. Deployment

Command
sglang serve \
  --model-path robbyant/lingbot-world-fast-diffusers \
  --pipeline-class-name LingBotWorldCausalDMDPipeline \
  --num-gpus 4 \
  --ulysses-degree 4 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false

3. Realtime WebUI

The lightweight local WebUI is useful for validating latency, frame transport, and camera control behavior.
Command
python -m http.server 18080 -d python/sglang/multimodal_gen/apps/realtime_webui
Open http://127.0.0.1:18080 and use:
Example
ws://127.0.0.1:30000/v1/realtime_video/generate

4. HTTP and WebSocket API

LingBot World uses the realtime video WebSocket endpoint. The server keeps one live session, generates one chunk at a time, and accepts runtime control events while generation is running.

Endpoints

APIMethodPurposeNotes
/v1/modelsGETQuery the served model id before opening a session.The WebUI uses this to fill the model field when the server exposes model metadata.
/v1/realtime_video/generateWebSocketCreate one realtime LingBot session and stream generated video chunks.The first client message must be an init message encoded with MessagePack.

init message

Send this MessagePack map immediately after the WebSocket opens.
ParameterTypeRequiredMeaning
typestringYesMust be "init".
modelstringNoModel id. Leave empty to use the served model.
promptstringYesText prompt for the initial scene and motion style.
first_framebytes or stringYesInitial reference image. Send bytes from the WebUI/client, or a server-readable image path/string.
sizestringYesGeneration size as WIDTHxHEIGHT, for example 832x480.
fpsnumberYesTarget playback FPS for the generated stream.
num_framesintegerYesFrames per generated chunk. LingBot uses chunked causal generation, so this controls per-chunk latency and queue size.
seedintegerNoRandom seed for deterministic sampling.
num_inference_stepsintegerNoDenoising steps per chunk. LingBot defaults to 4 when omitted.
guidance_scalenumberNoClassifier-free guidance scale. Realtime LingBot commonly uses 1.
negative_promptstringNoNegative prompt passed to the diffusion pipeline.
max_chunksintegerNoStop after this many chunks. Omit for a continuous session.
realtime_causal_sink_sizeintegerNoNumber of sink frames/tokens retained in the causal attention window.
realtime_causal_kv_cache_num_framesintegerNoNumber of recent frames retained in the causal KV cache window.
realtime_output_format"webp", "jpeg", "raw"NoPreview/output transport. webp and jpeg send encoded preview frames; raw sends raw RGB; omit for lossless delta-gzip RGB.
output_compressionintegerNoPreview quality for webp or jpeg, from 1 to 100.
enable_upscalingbooleanNoEnable server-side super resolution after frame decode.
upscaling_scaleintegerNoSuper-resolution scale. Current default is 4 when upscaling is enabled.
upscaling_model_pathstringNoOptional Real-ESRGAN model path.
enable_frame_interpolationbooleanNoEnable frame interpolation. Keep this disabled when measuring true generated FPS.
frame_interpolation_expintegerNoInterpolation multiplier exponent. 1 means 2x frames.
frame_interpolation_scalenumberNoRIFE internal scale for interpolation.
frame_interpolation_model_pathstringNoOptional RIFE model path.
condition_inputs.camera_actionslist[list[string]]NoInitial scripted camera actions, one action list per frame.

Runtime event messages

After init, send MessagePack event maps to update the live session.
ParameterTypeRequiredMeaning
typestringYesMust be "event".
kind"prompt" or "camera_actions"YesRuntime event kind.
payloadstring or object/listYesFor prompt, a non-empty string. For camera_actions, either scripted list[list[string]] or state-mode payload.
event_idintegerNoClient sequence id. The server echoes it in chunk/frame metadata after the event is sampled.
camera_actions supports two payload modes:
ModePayload shapeMeaning
Scriptlist[list[string]]A fixed sequence of per-frame actions consumed by upcoming chunks.
State{ "mode": "state", "transitions": [{"actions": [...], "client_ts_ms": ...}] }Live control state transitions from keyboard or UI controls.
Supported LingBot action tokens include w, a, s, d for camera movement and i, j, k, l for look controls.

Server messages

MessagePayloadMeaning
frame_batchMessagePack map with payload bytesOne batch of frames. The map includes chunk_index, num_frames, content_type, encoding, width, height, and frame-batch metadata.
chunk_statsMessagePack mapPer-chunk timing and transport metrics, including scheduler_forward_ms, raw_payload_build_ms, chunk_total_ms, num_frames, and ws_payload_bytes.
errorMessagePack mapServer-side validation or generation error.

Minimal client sketch

Python
import msgspec.msgpack
import websocket

ws = websocket.create_connection("ws://127.0.0.1:30000/v1/realtime_video/generate")
ws.send_binary(msgspec.msgpack.encode({
    "type": "init",
    "prompt": "A quiet rainy London alley, stable camera motion.",
    "first_frame": open("reference.jpg", "rb").read(),
    "size": "832x480",
    "fps": 25,
    "num_frames": 9,
    "num_inference_steps": 4,
    "guidance_scale": 1,
    "realtime_output_format": "webp",
    "output_compression": 95,
}))

ws.send_binary(msgspec.msgpack.encode({
    "type": "event",
    "kind": "camera_actions",
    "event_id": 1,
    "payload": {"mode": "state", "transitions": [{"actions": ["w"], "client_ts_ms": 0}]},
}))

5. Consistency

LingBot World uses raw-frame websocket GT plus per-chunk latency guards for consistency checks.

6. Notes

  • Use the realtime endpoint for interactive sessions: /v1/realtime_video/generate.
  • Prefer WebP preview transport for interactive testing; use raw-frame transport for consistency checks.
  • Long-running sessions should be validated with raw-frame consistency before changing causal cache, condition sampling, or VAE decode behavior.