1. Model Introduction
MiniMax-M2.5 is a powerful language model developed by MiniMax, built for real-world productivity with state-of-the-art performance across coding, reasoning, agentic tasks, and tool use. As the latest iteration in the MiniMax model series, MiniMax-M2.5 achieves comprehensive enhancements across multiple domains. Details are as follows:- Superior coding performance: Achieves 79.7 on Droid and 76.1 on OpenCode, surpassing Opus 4.6 (78.9 and 75.9 respectively). Strong results on SWE-bench Verified, SWE-bench Multilingual, SWE-bench-pro, and Multi-SWE-bench.
- Advanced reasoning: Demonstrates strong performance on AIME25 and other reasoning benchmarks, with robust tool use during inference.
- More capable agents: Excels in agentic tasks including web browsing (BrowseComp, Wide Search), information retrieval (RISE), and complex tool use scenarios (Terminal Bench 2, MEWC, Finance Modeling).
- Real-world productivity: Designed for production-grade workloads with strong performance on practical coding, data analysis, and multi-step reasoning tasks.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions. For AMD MI300X/MI325X/MI355X GPUs:Command
3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and feature capabilities.3.2 Configuration Tips
Key Parameters:| Parameter | Description | Recommended Value |
|---|---|---|
--tool-call-parser | Tool call parser for function calling support | minimax-m2 |
--reasoning-parser | Reasoning parser for thinking mode | minimax-append-think |
--trust-remote-code | Required for MiniMax model loading | Always enabled |
--mem-fraction-static | Static memory fraction for KV cache | 0.85 |
--tp | Tensor parallelism size | 2 (2-GPU) or 4 (4-GPU) or 8 (8-GPU) |
--ep | Expert parallelism size | 8 (NVIDIA 8-GPU) or EP=TP (AMD) |
--kv-cache-dtype | KV cache data type (AMD only) | fp8_e4m3 |
--attention-backend | Attention backend (AMD only) | triton |
- 4-GPU deployment: Requires 4× high-memory GPUs (e.g., H200, B200, A100, H100) with TP=4
- 8-GPU deployment: Requires 8× GPUs (e.g., H200, B200, A100, H100) with TP=8 and EP=8
- 2-GPU deployment: Requires 2× high-memory GPUs (e.g., MI300X, MI325X, MI355X) with TP=2, EP=2
- 4-GPU deployment: Requires 4× GPUs (e.g., MI300X, MI325X, MI355X) with TP=4, EP=4
- 8-GPU deployment: Requires 8× GPUs (e.g., MI300X, MI325X, MI355X) with TP=8, EP=8
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to: Testing Deployment: After startup, you can test the SGLang OpenAI-compatible API with the following command:Command
Example
Output
4.2 Advanced Usage
4.2.1 Reasoning Parser
MiniMax-M2.5 supports Thinking mode. Enable the reasoning parser during deployment to separate the thinking and the content sections:Command
minimax-append-think, the thinking content is wrapped in <think>...</think> tags within the content field. You can parse these tags on the client side to separate the thinking and content sections:
Example
Output
minimax-append-think reasoning parser embeds the thinking process in <think>...</think> tags within the content field. The code above parses these tags in real-time to display thinking and content separately.
4.2.2 Tool Calling
MiniMax-M2.5 supports tool calling capabilities. Enable the tool call parser:Command
Example
Output
- Tool calls are returned in
message.tool_callswith the function name and arguments - You can then execute the function and send the result back to continue the conversation
Example
5. Benchmark
This section uses industry-standard configurations for comparable benchmark results.5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU (8x)
- Model: MiniMax-M2.5
- Tensor Parallelism: 8
- Expert Parallelism: 8
- sglang version: 0.5.8
5.1.1 Standard Scenario Benchmark
- Model Deployment Command:
Command
5.1.1.1 Low Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.1.2 Medium Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.1.3 High Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.2 Summarization Scenario Benchmark
- Model Deployment Command:
Command
5.1.2.1 Low Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.2.2 Medium Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.2.3 High Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.3 H100 Benchmark
Test Environment:- Hardware: NVIDIA H100 80GB HBM3 GPU (8x)
- Model: MiniMax-M2.5
- Tensor Parallelism: 8
- Expert Parallelism: 8
- sglang version: 0.5.9
- Model Deployment Command:
Command
5.1.3.1 Low Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.3.2 Medium Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.3.3 High Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
- Test Results:
Output
5.2.2 MMLU Benchmark
- Benchmark Command:
Command
- Test Results:
Output
