1. Model Introduction
MiMo-V2.5-Pro and MiMo-V2.5 are next-generation Mixture-of-Experts models from the XiaomiMiMo Team.| Variant | Total params | Active (MoE) | Modalities |
|---|---|---|---|
| MiMo-V2.5-Pro | 1.02T | 42B | Text (multimodal planned) |
| MiMo-V2.5 | 310B | 15B | Text, Image, Video, Audio |
- Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) for reduced KV cache while preserving long-context capability.
- Multi-Token Prediction (MTP): 3-layer MTP module accelerates decoding (329M params on V2.5; V2.5-Pro supports EAGLE speculative decoding on top of MTP).
- 1M-Token Context: Both variants support up to 1 million token context windows.
- Agentic Capabilities: Post-training with large-scale agentic RL achieves strong performance on coding, reasoning, and tool-use benchmarks.
- MiMo-V2.5 Multimodal (V2.5 only): Native omnimodal architecture with a 729M-param ViT Vision Encoder (28 layers: 24 SWA + 4 Full) and a 261M-param Audio Transformer (24 layers: 12 SWA + 12 Full); supports image, video, and audio understanding via standard OpenAI-compatible multimodal API.
2. SGLang Installation
Refer to the official SGLang installation guide. Docker Images by Variant × Hardware:| Variant | Hardware | Docker Image |
|---|---|---|
| MiMo-V2.5 (310B) | H100 / H200 (Hopper, CUDA 12.9) | lmsysorg/sglang:dev-mimo-v2.5 |
| MiMo-V2.5 (310B) | B200 / GB300 (Blackwell, CUDA 13.0) | lmsysorg/sglang:dev-cu13-mimo-v2.5 |
| MiMo-V2.5-Pro (1.02T) | H100 / H200 (Hopper, CUDA 12.9) | lmsysorg/sglang:dev-mimo-v2.5-pro |
| MiMo-V2.5-Pro (1.02T) | B200 / GB300 (Blackwell, CUDA 13.0) | lmsysorg/sglang:dev-cu13-mimo-v2.5-pro |
Pull the image matching your GPU’s CUDA driver. lmsysorg/sglang:latest will not load either checkpoint.
3. Model Deployment
3.1 Basic Configuration
Use the selector below to generate the deployment command for your variant and hardware.3.2 Configuration Tips
MiMo-V2.5-Pro (1.02T):- B200: single node, TP=8 (verified). Uses
--attention-backend fa4+--moe-runner-backend flashinfer_trtllm+--mem-fraction-static 0.8. Set--swa-full-tokens-ratio 0.1to keep KV-cache footprint within 192 GB HBM. - GB300: 2 nodes, TP=8 (verified). Same Blackwell stack as B200; multi-node interconnect requires
NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1. Default SWA ratio is fine. - H100/H200: 2 nodes × 8 GPUs (TP=16, not yet verified). Uses the Hopper stack (
fa3+ DeepEP + EAGLE multi-layer); fits with--mem-fraction-static 0.7and--swa-full-tokens-ratio 0.3. DeepEP dispatch tuning:SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256avoids memory spikes during prefill. - EAGLE speculative decoding (3 steps, topk=1) typically yields a 2–3× decode speedup. Requires
SGLANG_ENABLE_SPEC_V2=1; on Hopper also pass--enable-multi-layer-eagle.
- The checkpoint has a TP=4-interleaved fused
qkv_proj; attention-TP per DP group must be 4. So DP-attention is always required (--dp = TP / 4), and total GPUs must be a multiple of 4. A bare--tp 8without--dp 2will fail to load withMiMoV2Omni fused qkv_proj checkpoint is TP=4-interleaved; got tp_size=8. - Single-node deployments: H100/H200 8× GPUs (
--tp 8 --dp 2), B200 4× GPUs (--tp 4, dp=1, no DP-attn flag needed), GB300 4× GPUs (--tp 4, single NVL4 node). FP8 quantization. --enable-dp-lm-headand--mm-enable-dp-encoderare required whenever--enable-dp-attentionis on, to keep LM head and encoder sharding consistent.- Multimodal: Supports image, video, and audio understanding; see Section 4.3 for invocation examples.
- DeepEP replaces the default MoE all-to-all dispatch with a fused DeepEP backend; it lowers expert dispatch latency and memory traffic, so it pays off under high concurrency / throughput-bound workloads on H100/H200. Under concurrency=1 / latency-bound workloads the gain is negligible — leave it off.
- Enabling adds
--moe-a2a-backend deepep+--moe-dense-tp-size 1(and--ep <tp>for Pro) plusSGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256env to cap the dispatch buffer. Requirespip install deep_ep(not part of the default sglang install). - On Blackwell (B200, GB300) the verified MoE backend is
flashinfer_trtllm; the DeepEP toggle is a no-op there.
4. Model Invocation
4.1 Basic Usage
See Basic API Usage.4.2 Reasoning Output
Both variants support hybrid thinking mode. Thinking content is separated via the reasoning parser. Thinking Mode (default):Example
Example
4.3 Multimodal Invocation (V2.5 only)
Image Understanding:Example
Example
Video decoding requiresAudio Understanding:decord(pip install decord); SGLang’s MiMo-V2.5 multimodal processor usesdecord.VideoReaderfor frame extraction.
Example
4.4 Tool Calling
Example
5. Benchmark
Accuracy numbers come fromsglang.test.run_eval (GSM8K standard 5-shot, MMMU validation split). Speed numbers come from sglang.bench_serving against the ShareGPT_Vicuna_unfiltered dataset; each request is configured with 1024 input tokens and 1024 output tokens to represent a typical medium-length conversation.
5.1 Accuracy Benchmark
5.1.1 GSM8K
Standard 5-shot,temperature=0, max_tokens=4096, model defaults to thinking-on (responses contain <think>...</think> and the eval extracts the trailing number via regex). Server launch: see Section 3.
Benchmark Command:
Command
run_eval.pyautomatically appends/v1to--base-url; pass the barehost:portURL (without trailing/v1), otherwise requests resolve to/v1/v1/chat/completionsand 404.
- Test Results:
- MiMo-V2.5-Pro (FP8)
- MiMo-V2.5 (FP8, 8× H200)
- MiMo-V2.5-Pro (FP8)
5.1.2 MMMU (V2.5 only)
MMMU/MMMU validation split (multi-discipline multimodal), concurrency=16, default sampling.
- Benchmark Command:
Command
- Test Results:
- MiMo-V2.5 (FP8)
- MiMo-V2.5 (FP8)
5.2 Speed Benchmark — MiMo-V2.5-Pro
Test Environment:- Hardware: NVIDIA B200 GPU (8×)
- Model:
XiaomiMiMo/MiMo-V2.5-Pro(FP8) - Tensor Parallelism: 8
- Recipe: Balanced (DP-attn + DeepEP + EAGLE MTP)
- sglang version: Pending update
5.2.1 Latency-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.2.2 Throughput-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.3 Speed Benchmark — MiMo-V2.5
Test Environment:- Hardware: NVIDIA H200 GPU (8×)
- Model:
XiaomiMiMo/MiMo-V2.5(FP8) - Tensor Parallelism: 8 (DP-attention with
--dp-size 2) - Recipe: Balanced (DP-attn)
- sglang version: 1.1.2.dev9066
5.3.1 Latency-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.3.2 Throughput-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.3.3 Multimodal (Image) Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
