Low Latency
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| MiniMax-M2.5 | Atlas 800I A3 | 8 | PD Mixed | 128K+1K (90% prefix cache hit rate) | 20ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 20ms | W8A8 INT8 | Optimal Configuration |
High Throughput
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| MiniMax-M2.5 | Atlas 800I A3 | 16 | PD Disaggregation | 128K+1K (90% prefix cache hit rate) | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 16 | PD Disaggregation | 64K+1K (90% prefix cache hit rate) | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 4 | PD Mixed | 32K+1K | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 4 | PD Mixed | 64K+1K (90% prefix cache hit rate) | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 8 | PD Mixed | 128K+1K (90% prefix cache hit rate) | 50ms | W8A8 INT8 | Optimal Configuration |
| MiniMax-M2.5 | Atlas 800I A3 | 8 | PD Mixed | 3.5K+1.5K | 50ms | W8A8 INT8 | Optimal Configuration |
Optimal Configuration
MiniMax-M2.5 W8A8 1P1D 16P IN128K OUT1K PREFIX90 50ms
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 Cards: 16 Deploy Mode: PD Disaggregation Quantization: W8A8 INT8 Dataset: 128K+1K (90% prefix cache hit rate) TPOT: 50msModel Deployment
Command
Command
Benchmark
We tested it based on thegenerated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9):
--gsp-system-prompt-len 117964 = int(131072 * 0.9) is the shared prefix portion.
--gsp-question-len 13107 = int(131072 * (1 - 0.9)) is the unique per-request suffix.
--gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
MiniMax-M2.5 W8A8 1P1D 16P IN64K OUT1K PREFIX90 50ms
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 Cards: 16 Deploy Mode: PD Disaggregation Quantization: W8A8 INT8 Dataset: 64K+1K (90% prefix cache hit rate) TPOT: 50msModel Deployment
Command
Command
Benchmark
We tested it based on thegenerated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9):
--gsp-system-prompt-len 58982 = int(65536 * 0.9) is the shared prefix portion.
--gsp-question-len 6553 = int(65536 * (1 - 0.9)) is the unique per-request suffix.
--gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
MiniMax-M2.5 W8A8 4P IN32K OUT1K 50ms
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 Cards: 4 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 32K+1K TPOT: 50msModel Deployment
Command
Benchmark
We tested it based on theRANDOM dataset.
Command
MiniMax-M2.5 W8A8 4P IN64K OUT1K PREFIX90 50ms
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 Cards: 4 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 64K+1K (90% prefix cache hit rate) TPOT: 50msModel Deployment
Command
Benchmark
We tested it based on thegenerated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9):
--gsp-system-prompt-len 58982 = int(65536 * 0.9) is the shared prefix portion.
--gsp-question-len 6553 = int(65536 * (1 - 0.9)) is the unique per-request suffix.
--gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
MiniMax-M2.5 W8A8 8P IN128K OUT1K PREFIX90 20ms
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 128K+1K (90% prefix cache hit rate) TPOT: 20msModel Deployment
Command
Benchmark
We tested it based on thegenerated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9):
--gsp-system-prompt-len 117964 = int(131072 * 0.9) is the shared prefix portion.
--gsp-question-len 13107 = int(131072 * (1 - 0.9)) is the unique per-request suffix.
--gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
MiniMax-M2.5 W8A8 8P IN128K OUT1K PREFIX90 50ms
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 128K+1K (90% prefix cache hit rate) TPOT: 50msModel Deployment
Command
Benchmark
We tested it based on thegenerated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9):
--gsp-system-prompt-len 117964 = int(131072 * 0.9) is the shared prefix portion.
--gsp-question-len 13107 = int(131072 * (1 - 0.9)) is the unique per-request suffix.
--gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
MiniMax-M2.5 W8A8 8P IN3K5 OUT1K5 20ms
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 3.5K+1.5K TPOT: 20msModel Deployment
Command
Benchmark
We tested it based on theRANDOM dataset.
Command
MiniMax-M2.5 W8A8 8P IN3K5 OUT1K5 50ms
Model: MiniMax-M2.5 Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W8A8 INT8 Dataset: 3.5K+1.5K TPOT: 50msModel Deployment
Command
Benchmark
We tested it based on theRANDOM dataset.
Command
