1. Model Introduction
MiniCPM-V 4.6 is the next-generation multimodal model from OpenBMB, the team behind the MiniCPM-V series. The model combines a Qwen3.5-style hybrid LLM backbone (Gated Delta Net + full attention) with a NaViT-packed vision encoder that handles arbitrary aspect ratios and high-resolution slicing natively, plus end-to-end video support. OpenBMB ships two variants on HuggingFace:openbmb/MiniCPM-V-4.6— base instruct model. Use this for general multimodal serving; thinking mode is still available per-request viachat_template_kwargs.enable_thinking=true.openbmb/MiniCPM-V-4.6-Thinking— thinking-tuned variant with stronger chain-of-thought behavior. Pair with the same--reasoning-parser qwen3flag.
- Hybrid LLM backbone: Qwen3.5-style mix of Gated Delta Net (linear-attention) layers and full-attention layers, providing long-context efficiency without giving up modeling power.
- Native variable-resolution vision: NaViT-packed vision encoder with mid-ViT merger and per-image window attention. Images of any aspect ratio are processed without forced letterboxing.
- High-resolution slicing: Source image plus a configurable grid of slice tiles (up to 9 tiles in the open test variant) lets the model reason over fine detail in 1280×720+ images.
- Video: Frame-by-frame multi-modal data items routed through the same vision encoder; any number of frames per request.
- Reasoning Parser: switchable thinking mode (Qwen3.5 lineage), exposed via
chat_template_kwargs.enable_thinkingper request and SGLang’s--reasoning-parser qwen3on the server side. - Tool Calling: Qwen3.5-style
<tool_call><function=…><parameter=…>…</parameter></function></tool_call>XML format, surfaced as OpenAI-compatiblemessage.tool_callsvia SGLang’s--tool-call-parser qwen3_coder. Composes with thinking mode and with image / video inputs.
2. SGLang Installation
Pull the nightly Docker image (rolling tag, tracksmain):
3. Model Deployment
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to generate the appropriate deployment command. TheVariant toggle switches between openbmb/MiniCPM-V-4.6 (base) and openbmb/MiniCPM-V-4.6-Thinking. The Reasoning Parser and Tool Call Parser toggles add --reasoning-parser qwen3 and --tool-call-parser qwen3_coder respectively; see §4.4 for usage details.
3.2 Configuration Tips
- Mamba Radix Cache: Qwen3.5’s hybrid Gated Delta Networks architecture supports two mamba scheduling strategies via
--mamba-scheduler-strategy:- V1 (
no_buffer): Default. No overlap scheduler, lower memory usage. Required for AMD MI GPUs. - V2 (
extra_buffer): Enables overlap scheduling and branching point caching with--mamba-scheduler-strategy extra_buffer --page-size 64. Requires FLA kernel backend (NVIDIA GPUs only). Trades higher mamba state memory for better throughput. Strictly superior in non-KV-cache-bound scenarios; in KV-cache-bound cases, weigh the overlap scheduling benefit against reduced max concurrency.--page-sizemust satisfyFLA_CHUNK_SIZE % page_size == 0orpage_size % FLA_CHUNK_SIZE == 0(FLA_CHUNK_SIZEis currently 64).
- V1 (
- The
--mem-fraction-staticflag is recommended for optimal memory utilization, adjust it based on your hardware and workload. - Context length defaults to 262,144 tokens. If you encounter OOM errors, consider reducing it, but maintain at least 128K to preserve thinking capabilities.
- To speed up weight loading for this large model, add
--model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'to the launch command. - CUDA IPC Transport: Add
SGLANG_USE_CUDA_IPC_TRANSPORT=1as an environment variable to use CUDA IPC for transferring multimodal features, significantly improving TTFT (Time To First Token). Note: this consumes additional memory proportional to image size, so you may need to lower--mem-fraction-staticor--max-running-requests. - Multimodal Attention Backend: Use
--mm-attention-backend fa3on H100/H200 for better vision performance, or--mm-attention-backend fa4on B200/B300. - For processing large images or videos, you may need to lower
--mem-fraction-staticto leave room for image feature tensors. - Multi-image and high-resolution images: the image processor produces one source patch plus per-slice tile patches; each is its own
MultimodalDataItem. No special server-side flag needed. - Video: decoded frame-by-frame through the same image-style slicer. No extra flag needed; pass
video_urlin the OpenAI chat completion request. - Chunked Prefill: For high-concurrency vision benchmarking with many large/sliced images, pass
--chunked-prefill-size -1to disable prefill chunking. The default chunked-prefill path can mis-split a request across an image boundary inmm_utils.embed_mm_inputsand crash the server; disabling chunking sidesteps this at the cost of higher TTFT under concurrency. For interactive serving leave the default on.
4. Model Invocation
Deploy the model on an H200:Command
4.1 Basic Usage (Image)
Example
Output
4.2 High-Resolution / Sliced Images
The image processor automatically picks a slice grid (up to 9 tiles) for high-resolution inputs. A 1280×720 source produces grid[2, 3]
- 7 patches with
tgt_sizes=[(24, 44), 6×(28, 36)], byte-for-byte matching the HF reference implementation.
Example
Output
4.3 Video Input
Example
Output
4.4 Advanced Usage
4.4.1 Reasoning Parser
Pass--reasoning-parser qwen3 to the server (toggle “Reasoning Parser” on in §3.1, default) so SGLang splits each response on the <think> / </think> boundaries: the pre-</think> block goes to reasoning_content, the post-</think> text to content. Per-request, the chat template’s enable_thinking flag toggles whether the model actually emits reasoning.
- Thinking mode (default,
enable_thinking=true): assistant prompt ends with<think>\n; the model writes reasoning, closes with</think>, then the answer.reasoning_contentandcontentare both populated. - Instruct mode (
enable_thinking=false): the chat template injects an empty<think></think>placeholder so the model emits no thinking tokens;reasoning_contentends up empty.
Example (thinking mode)
Output
Example (instruct mode)
Output
4.4.2 Tool Calling
Pass--tool-call-parser qwen3_coder to the server (toggle “Tool Call Parser” on in §3.1) so SGLang extracts <tool_call> blocks from the model output into the OpenAI-style message.tool_calls field (with finish_reason="tool_calls"). The model speaks the Qwen3.5 XML tool-call format (<tool_call><function=name><parameter=k>v</parameter></function></tool_call>); the qwen3_coder parser is the right one. Tool calls compose with both reasoning modes and with image / video inputs.
Example
Output
tool role message and call the API again with the same tools list — the model emits finish_reason="stop" with the answer in content.
5. Benchmark
Common Test Environment (all benchmarks below):- Hardware: 1× NVIDIA H200 (141 GB), single GPU (no TP / DP)
- Docker Image:
lmsysorg/sglang:dev(transformers 5.6.0, sgl-kernel 0.4.2.post1) - Precision: BF16
Command
--chunked-prefill-size -1 is required for the vision throughput run; see §3.2.)
5.1 Accuracy Benchmark
5.1.1 MMMU Benchmark
- Benchmark Command
Command
- Test Result
5.2 Speed Benchmark
We use SGLang’s built-inbench_serving tool with random text prompts (1000 input / 1000 output tokens) to characterize text-only serving performance.
5.2.1 Latency Benchmark
Command
Output
5.2.2 Throughput Benchmark
Command
Output
5.3 Vision Speed Benchmark
We use SGLang’s built-inbench_serving tool with random images. Each request has 128 input text tokens, one 720p image, and 1024 output tokens.
5.3.1 Latency Benchmark
Command
Output
5.3.2 Throughput Benchmark
Command
Output
