Ministral-3 - SGLang Documentation

1. Model Introduction

The largest model in the Ministral 3 family, Ministral 3 14B offers frontier capabilities and performance comparable to its larger Mistral Small 3.2 24B counterpart. A powerful and efficient language model with vision capabilities. The Ministral 3 14B Instruct model offers the following capabilities: Vision: Enables the model to analyze images and provide insights based on visual content, in addition to text. Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic. System Prompt: Maintains strong adherence and support for system prompts. Agentic: Offers best-in-class agentic capabilities with native function calling and JSON outputting. Edge-Optimized: Delivers best-in-class performance at a small scale, deployable anywhere. Apache 2.0 License: Open-source license allowing usage and modification for both commercial and non-commercial purposes. Large Context Window: Supports a 256k context window. For further details, please refer to the official documentation

2. SGLang Installation

Please refer to the official SGLang installation guide for installation instructions.

3. Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities.

3.2 Configuration Tips

Context length vs memory: Ministral-3 advertises a long context window; if you are memory-constrained, start by lowering —context-length (for example 32768) and increase once things are stable. Pre-installation steps: Adding the following steps after launching the docker

Command

pip install mistral-common --upgrade
pip install transformers==5.0.0.rc0

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

4.2.1 Launch the docker

Command

docker pull lmsysorg/sglang:v0.5.9-rocm720-mi30x

Command

docker run -d -it --ipc=host --network=host --privileged \
  --cap-add=CAP_SYS_ADMIN \
  --device=/dev/kfd --device=/dev/dri --device=/dev/mem \
  --group-add video --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v /:/work \
  -e SHELL=/bin/bash \
  --name Ministral \
 lmsysorg/sglang:v0.5.9-rocm720-mi30x \
  /bin/bash

4.2.2 Launch the server

Command

sglang serve \
  --model-path mistralai/Ministral-3-14B-Instruct-2512 \
  --tp 1 \
  --trust-remote-code

5. Benchmark

This section uses industry-standard configurations for comparable benchmark results.

5.1 Speed Benchmark

Test Environment:

Hardware: MI300X GPU (8x)
Model: mistralai/Ministral-3-14B-Instruct-2512
Tensor Parallelism: 1
SGLang Version: 0.5.7
Model Deployment Command:

Command

sglang serve \
  --model-path mistralai/Ministral-3-14B-Instruct-2512 \
  --tp 1 \
  --trust-remote-code

Low Concurrency

Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --model mistralai/Ministral-3-14B-Instruct-2512 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  65.08
Total input tokens:                      6101
Total input text tokens:                 6101
Total input vision tokens:               0
Total generated tokens:                  4220
Total generated tokens (retokenized):    4218
Request throughput (req/s):              0.15
Input token throughput (tok/s):          93.75
Output token throughput (tok/s):         64.84
Peak output token throughput (tok/s):    151.00
Peak concurrent requests:                2
Total token throughput (tok/s):          158.59
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   6505.51
Median E2E Latency (ms):                 3037.37
---------------Time to First Token----------------
Mean TTFT (ms):                          3709.33
Median TTFT (ms):                        53.72
P99 TTFT (ms):                           33320.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.63
Median TPOT (ms):                        6.64
P99 TPOT (ms):                           6.66
---------------Inter-Token Latency----------------
Mean ITL (ms):                           6.64
Median ITL (ms):                         6.65
P95 ITL (ms):                            6.75
P99 ITL (ms):                            6.82
Max ITL (ms):                            8.45
==================================================

Medium Concurrency

Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --model mistralai/Ministral-3-14B-Instruct-2512 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  31.20
Total input tokens:                      39668
Total input text tokens:                 39668
Total input vision tokens:               0
Total generated tokens:                  40805
Total generated tokens (retokenized):    40783
Request throughput (req/s):              2.56
Input token throughput (tok/s):          1271.38
Output token throughput (tok/s):         1307.82
Peak output token throughput (tok/s):    1760.00
Peak concurrent requests:                22
Total token throughput (tok/s):          2579.20
Concurrency:                             13.72
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   5351.07
Median E2E Latency (ms):                 5626.45
---------------Time to First Token----------------
Mean TTFT (ms):                          280.87
Median TTFT (ms):                        68.16
P99 TTFT (ms):                           1194.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.47
Median TPOT (ms):                        10.10
P99 TPOT (ms):                           20.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           9.96
Median ITL (ms):                         9.10
P95 ITL (ms):                            9.87
P99 ITL (ms):                            51.39
Max ITL (ms):                            888.63
==================================================

High Concurrency

Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --model mistralai/Ministral-3-14B-Instruct-2512 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 500 \
  --max-concurrency 100 \
  --request-rate inf

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     500
Benchmark duration (s):                  88.75
Total input tokens:                      249831
Total input text tokens:                 249831
Total input vision tokens:               0
Total generated tokens:                  252662
Total generated tokens (retokenized):    252547
Request throughput (req/s):              5.63
Input token throughput (tok/s):          2815.01
Output token throughput (tok/s):         2846.91
Peak output token throughput (tok/s):    4271.00
Peak concurrent requests:                110
Total token throughput (tok/s):          5661.93
Concurrency:                             93.04
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16514.45
Median E2E Latency (ms):                 15834.45
---------------Time to First Token----------------
Mean TTFT (ms):                          148.57
Median TTFT (ms):                        99.15
P99 TTFT (ms):                           455.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.93
Median TPOT (ms):                        34.73
P99 TPOT (ms):                           38.05
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.45
Median ITL (ms):                         27.30
P95 ITL (ms):                            71.73
P99 ITL (ms):                            73.45
Max ITL (ms):                            328.10
==================================================

5.2 Accuracy Benchmark

Document model accuracy on standard benchmarks:

5.2.1 GSM8K Benchmark

Benchmark Command

Command

python3 benchmark/gsm8k/bench_sglang.py \
  --num-shots 8 \
  --num-questions 1316 \
  --parallel 1316

Test Results:

Output

Accuracy: 0.959
Invalid: 0.000
Latency: 29.185 s
Output throughput: 4854.672 token/s

​1. Model Introduction

​2. SGLang Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Basic Usage

​4.2 Advanced Usage

​4.2.1 Launch the docker

​4.2.2 Launch the server

​5. Benchmark

​5.1 Speed Benchmark

Low Concurrency

Medium Concurrency

High Concurrency

​5.2 Accuracy Benchmark

​5.2.1 GSM8K Benchmark

1. Model Introduction

2. SGLang Installation

3. Model Deployment

3.1 Basic Configuration

3.2 Configuration Tips

4. Model Invocation

4.1 Basic Usage

4.2 Advanced Usage

4.2.1 Launch the docker

4.2.2 Launch the server

5. Benchmark

5.1 Speed Benchmark

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark