Documentation Index
Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
MOVA (MOSS Video and Audio) is a foundation model developed by the SII-OpenMOSS Team, designed to break the “silent era” of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously in a single inference pass for perfect alignment. It adopts an Asymmetric Dual-Tower Architecture, fusing pre-trained video and audio towers through a bidirectional cross-attention mechanism to maintain tight synchronization between video and audio during generation. MOVA-360p is suitable for fast inference and resource-constrained environments. MOVA-720p provides higher resolution video generation. Both versions support generating up to 8 seconds of video-audio content. Key Features:- Native Bimodal Generation: Generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation from cascaded pipelines
- Precise Lip-Sync: Achieves state-of-the-art performance in multilingual lip-synchronization (LSE-D: 7.094, LSE-C: 7.452 with Dual CFG on Verse-Bench Set3)
- Environment-Aware Sound Effects: Generates corresponding environmental sound effects including physical interaction sounds, ambient sounds, and spatial/textural sound feedback
- Fully Open-Source: Model weights, inference code, training pipelines, and LoRA fine-tuning scripts are all open-sourced
2. SGLang-diffusion Installation
SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang-diffusion installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
MOVA supports both online serving and CLI generation modes. The recommended launch configurations vary by hardware and resolution. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform.3.2 Configuration Tips
Current supported optimization all listed here.--num-gpus: Number of GPUs to use--tp: Tensor parallelism size (should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)--ring-degree: The degree of ring attention-style SP in USP--ulysses-degree: The degree of DeepSpeed-Ulysses-style SP in USP--adjust-frames: Whether to adjust frames automatically (set tofalsefor MOVA)--enable-torch-compile: Enable torch.compile for faster inference
4. API Usage
For complete API documentation, please refer to the official API usage guide.4.1 CLI Generation (sglang generate)
Command
4.2 Generate a Video
Command
4.3 Advanced Usage
4.3.1 Cache-DiT Acceleration
SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can setSGLANG_CACHE_DIT_ENABLED=True to enable it. For more details, please refer to the SGLang Cache-DiT documentation.
Basic Usage
Command
- DBCache Parameters: DBCache controls block-level caching behavior:
| Parameter | Env Variable | Default | Description |
|---|---|---|---|
| Fn | SGLANG_CACHE_DIT_FN | 1 | Number of first blocks to always compute |
| Bn | SGLANG_CACHE_DIT_BN | 0 | Number of last blocks to always compute |
| W | SGLANG_CACHE_DIT_WARMUP | 4 | Warmup steps before caching starts |
| R | SGLANG_CACHE_DIT_RDT | 0.24 | Residual difference threshold |
| MC | SGLANG_CACHE_DIT_MC | 3 | Maximum continuous cached steps |
- TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion:
| Parameter | Env Variable | Default | Description |
|---|---|---|---|
| Enable | SGLANG_CACHE_DIT_TAYLORSEER | false | Enable TaylorSeer calibrator |
| Order | SGLANG_CACHE_DIT_TS_ORDER | 1 | Taylor expansion order (1 or 2) |
Command
4.3.2 CPU Offload
--dit-cpu-offload: Use CPU offload for DiT inference. Enable if run out of memory.--text-encoder-cpu-offload: Use CPU offload for text encoder inference.--vae-cpu-offload: Use CPU offload for VAE.--pin-cpu-memory: Pin memory for CPU offload. Only added as a temp workaround if it throws “CUDA error: invalid argument”.
5. Benchmark
5.1 Speedup Benchmark
5.1.1 Generate a video
Test Environment:- Hardware: NVIDIA H200 x 8
- git revision: 443b1a8
- Model: OpenMOSS-Team/MOVA-720p
Command
Command
Output
5.1.2 Generate videos with high concurrency
Server Command:Command
Command
