1. Model Introduction
Mistral Small 4 is a powerful hybrid model from Mistral AI that unifies the capabilities of three different model families — Instruct, Reasoning (formerly called Magistral), and Agentic (formerly called Devstral) — into a single, unified model. With its multimodal capabilities, efficient MoE architecture, and flexible mode switching, Mistral Small 4 is a versatile general-purpose model for virtually any task. In a latency-optimized setup, it achieves a 40% reduction in end-to-end completion time; in a throughput-optimized setup, it delivers 3× more requests per second compared to Mistral Small 3. Key Features:- Hybrid Reasoning: Switch between instant reply mode and deep reasoning/thinking mode — reasoning effort is configurable per request
- Vision: Accepts both text and image inputs, providing insights based on visual content
- Function Calling: Native tool calling and JSON output support with best-in-class agentic capabilities
- Multilingual: Supports dozens of languages including English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, and more
- Context Window: 256K context window
- Efficient MoE: 119B total parameters, 128 experts, 4 active per token (6.5B activated parameters)
- Apache 2.0 License: Open-source, usable and modifiable for commercial and non-commercial purposes
- Reasoning effort supported are only “none” and “high”
- Same general architecture as Mistral 3
- MoE: 128 experts, 4 active per token
- 119B total parameters, 6.5B activated per token
- Multimodal input: text + image
- mistralai/Mistral-Small-4-119B-2603 (FP8)
- mistralai/Mistral-Small-4-119B-2603-NVFP4
- mistralai/Leanstral-2603 — same architecture, use the same launch commands as Mistral-Small-4-119B-2603
- mistralai/Mistral-Small-4-119B-2603-eagle — EAGLE speculative decoding weights for faster inference
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.Mistral Small 4 support landed in sgl-project/sglang#20708 and has been merged into
main. A model-specific Docker image is no longer required. Use the standard SGLang installation methods from the official installation guide.3. Model Deployment
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to generate a launch command for Mistral Small 4.3.2 Configuration Tips
- Tensor Parallelism: Mistral Small 4 FP8 (~119 GB) requires tp=2 on Hopper (H100/H200), tp=1 on Blackwell (B200/B300). NVFP4 (~60 GB, Blackwell only) runs with tp=1.
- Reasoning effort: Reasoning depth is configurable per request via
reasoning_effort("none","high"). No restart required — toggle per call. - Context length vs memory: The model has a 256K context window. If you are memory-constrained, lower
--context-length(e.g.32768) and increase once things are stable. - Tool calling: Enable
--tool-call-parser mistralto activate native function calling support. - Reasoning parser: Enable
--reasoning-parser mistralto separatereasoning_contentfrom the main response content. - Speculative decoding (EAGLE): Enable with
--speculative-algorithm EAGLE --speculative-draft-model-path mistralai/Mistral-Small-4-119B-2603-eagleusing the EAGLE weights for lower latency.
4. Model Invocation
4.1 Thinking Mode
Mistral Small 4 is a hybrid reasoning model. By default, it does not produce a default reasoning response. Use--reasoning_effort high to toggle reasoning on.
Example
Output
4.2 Instruct Mode (Reasoning Off)
To skip the reasoning trace and get a fast direct response, setreasoning_effort to "none":
Example
Output
4.3 Streaming with Reasoning
Example
Output
4.4 Tool Calling
Mistral Small 4 supports native function calling. Enable with--tool-call-parser mistral:
Example
Output
4.5 Vision (Image Input)
Mistral Small 4 accepts image inputs alongside text:Example
Output
5. Benchmarks
5.1 Accuracy Benchmarks
GSM8K
Command
Output
MMLU
Command
Output
5.2 Speed Benchmarks
Latency (Low Concurrency)
Command
Output
Throughput (High Concurrency)
Command
Output
