Skip to main content

Deployment

For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.
Command
pip install -U uv
uv venv --python 3.12 && source .venv/bin/activate

# MiniMax-M3 ships in SGLang PR #27944, not yet in a tagged release — install from
# the PR head. The serving runtime is in the base dependencies, so no extra is needed:
git clone https://github.com/sgl-project/sglang.git
cd sglang
git fetch origin pull/27944/head && git checkout FETCH_HEAD
uv pip install -e python
Then run the Python output of the command panel below in that environment. The Docker tab is simpler — its image bundles the CUDA-13 runtime and the #27944 code. Once PR #27944 is merged and released, uv pip install sglang will pull M3 support directly.
Pick your hardware + recipe to generate the launch command.

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.

1. Model Introduction

MiniMax-M3 is MiniMax’s native-multimodal Mixture-of-Experts reasoning model: ~428B total parameters with ~23B activated per token (128 experts, 4 active per token), 60 layers, and a 1M-token context over text, image, and video. Its defining feature is MiniMax Sparse Attention (MSA) — a block-sparse “lightning indexer” attention that keeps long-context cost low (MiniMax reports ~9× prefill / ~15× decode speedup over M2 at 1M context). This page serves the MXFP8 variant (MiniMaxAI/MiniMax-M3-MXFP8, ~440 GB) on NVIDIA Blackwell and AMD Instinct; on NVIDIA Hopper (H200), use the full-precision bfloat16 build MiniMaxAI/MiniMax-M3 (§2.4). Released under the MiniMax Community License. Key characteristics as served by SGLang:
  • Multimodal (vision + text): accepts interleaved text and images through the OpenAI-compatible chat API (loaded as MiniMaxM3SparseForConditionalGeneration). Image input via URL and base64 is validated; video input has not been tested here.
  • Reasoning model: emits its chain of thought wrapped in <mm:think>...</mm:think>. Always launch with --reasoning-parser auto — it auto-detects the right parser from the chat template, and SGLang then strips the tags and returns the trace separately in message.reasoning_content.
  • Native tool calling: a custom namespace-token XML format, parsed into standard OpenAI tool_calls. Always launch with --tool-call-parser auto — it auto-detects the right parser from the chat template. Single, parallel, and nested (object / array) arguments are supported.
  • Sparse attention: most layers use M3’s “lightning indexer” block-sparse attention (top-k 128-token blocks), which keeps decode cost roughly flat in context length. On Blackwell, MiniMax’s open-source MSA kernel accelerates this path further (§2.1).
  • MXFP8 quantization across vendors: the MXFP8 MoE weights run natively on NVIDIA Blackwell (B200 / B300 / GB200 / GB300) and on AMD Instinct MI350X/MI355X (gfx950 / CDNA4), both of which have hardware MX-scaled matmul. On AMD MI300X/MI325X (gfx942 / CDNA3) — no hardware MX — SGLang converts the weights to block-fp8 [128,128] at load and serves them on the tuned ROCm kernels (§2.3). The vision tower stays unquantized.
Recommended generation (from the model card): temperature 1.0, top_p 0.95, top_k 40 — SGLang applies these automatically from the model’s generation_config.json. Resources: HuggingFace · MSA kernel

2. Configuration Tips

MiniMax MSA (fmha_sm100, MIT-licensed) is the recommended Blackwell kernel for M3’s main sparse-attention step — faster and more memory-efficient than the built-in Triton fallback. It is purely additive — install it and the recipe above engages it automatically; without it the same recipe still serves on the built-in Triton path. The swap is numerically equivalent (cosine ≥ 0.99999 vs Triton), decode stays CUDA-graph-capturable, prefill TTFT drops ~9–12% at 8K–64K context, and the MSA path survives memory configurations where the Triton path OOMs. Requirements (from the MSA README):
  • GPU: NVIDIA SM100 family — sm_100 (B200 / GB200) and sm_103 (B300 / GB300).
  • Toolchain: CUDA Toolkit with nvcc ≥ 12.x on PATH (or CUDA_HOME set) — the kernels are JIT-compiled at first import.
  • Python: ≥ 3.10; OS: Linux — works on both x86_64 and aarch64 (Grace, e.g. GB200 / GB300); the aarch64 build needs no source edits.
Command
# --recursive pulls the CUTLASS submodule required for JIT compilation
git clone --recursive https://github.com/MiniMax-AI/MSA.git msa
cd msa && pip install .
# Verify the SGLang gate (True -> MSA engaged on this device; False -> Triton fallback):
python -c "from sglang.srt.layers.attention.minimax_sparse_ops.msa import msa_available; print(msa_available())"
The first import JIT-compiles the kernels, which can take 30 s to a few minutes on a cold nvcc cache — this is normal, not a hang. Subsequent server starts hit the JIT cache.
Warm the JIT cache before a multi-GPU launch. On a cold cache, several tensor-parallel ranks racing to JIT-compile MSA’s plan kernel can leave one rank loading a half-linked module (AttributeError: Module has no function 'plan' at CUDA-graph capture). Run the gate-check python -c "..." (or any single-process fmha_sm100_plan call) once before launching the server — that compiles the kernel single-process, and every rank then hits the warm cache.
The gate requires --attention-backend fa4 --page-size 128 (already part of the Blackwell recipe above; on current main these are also the auto-selected M3 defaults on SM100 GPUs). Force the Triton path at any time with the env var SGLANG_DISABLE_MSA=1. MSA is a Blackwell (SM100) kernel and does not apply to the AMD ROCm paths.
For multimodal (image) serving, keep the same text recipe above — --attention-backend fa4 --page-size 128 (MSA) is unchanged — and add --mm-attention-backend flashinfer_cudnn for the vision tower. The text and vision-tower attention backends are independent knobs; MSA only touches the language-model sparse attention, not image handling.

2.2 Memory and workload tuning

The NVIDIA Blackwell recipe is the validated single-node 4-GPU (--tp 4) config, which is also the GB200 / GB300 single-node ceiling. It runs identically on B200 (sm_100), B300 (sm_103), and GB300 (sm_103, aarch64); GB200 (sm_100, aarch64) is inferred-supported — both of its axes are validated above — but not directly benchmarked. The AMD recipes use 8-GPU (--tp 8).
  • Memory: --mem-fraction-static trades KV-pool capacity against prefill activation headroom0.75 is the safe default on NVIDIA (0.80 on AMD). A higher value is fine at low concurrency but OOMs under high concurrency or long context, so raise it only for interactive single-stream serving.
  • Long context (32K+): keep --mem-fraction-static at the platform default and raise --chunked-prefill-size to 16384. Decode TPOT stays roughly flat in context length thanks to sparse attention; 1K–128K prompts are validated.
  • 8-GPU nodes: B200 / B300 hosts with 8 GPUs can use --tp 8 for more throughput / KV headroom; tp4 is documented as the NVIDIA cross-family common denominator.
  • Expert parallelism: to trade latency for throughput add --ep (see Expert Parallelism Deployment). On AMD, set --ep equal to --tp. Shared-experts fusion is automatically disabled when EP > 1; on AMD standard EP the server also disables --enable-aiter-allreduce-fusion automatically to preserve accuracy.
  • --trust-remote-code is required to load the MiniMax config / processor classes.

2.3 AMD Instinct (ROCm)

MiniMax-M3 runs on AMD Instinct GPUs through two code paths, by architecture — both selected automatically; you still pass --quantization mxfp8 either way:
  • MI350X / MI355X (gfx950, CDNA4) has hardware MX-scaled matmul, so the MXFP8 weights are served natively. SGLang auto-detects the checkpoint, selects the Triton MiniMax-M3 MoE path with the packaged tuned MXFP8 configs, and enables AITER fused all-reduce for single-node tensor parallelism. The launch command is the NVIDIA recipe minus the Blackwell-only backend flags.
  • MI300X / MI325X (gfx942, CDNA3) has no hardware MX matmul. SGLang transparently converts the MXFP8 weights to block-fp8 [128,128] at load time, then serves them with the tuned ROCm block-fp8 kernels (--attention-backend aiter, --moe-runner-backend triton; the aiter runner also works and scores marginally higher). On a cold start the first generation can JIT-compile AITER configs and exceed the default warmup/HTTP timeout, so the recipe adds --watchdog-timeout 3600 --skip-server-warmup. The block-fp8 step adds only a small relative error over MXFP8’s native 1×32 scaling — negligible on GSM8K (see the benchmark card).
Select an MI300X/MI325X or MI350X/MI355X tile in the command panel above to get the exact launch command for each path.
The AMD recipes are validated end-to-end on text workloads — chat, reasoning separation, and tool calling. The vision tower was not exercised on ROCm; for image input on AMD, omit the Blackwell --mm-attention-backend flashinfer_cudnn flag and let the encoder use the ROCm default backend, and treat vision as unvalidated on that path.

2.4 Serving on Hopper (H200) with the bf16 build

The MXFP8 kernels are Blackwell-only, so Hopper (H200) serves the full-precision bfloat16 build MiniMaxAI/MiniMax-M3. Select H200 + BF16 in the Deploy panel above for the exact command — it runs at --tp 8 (the bf16 weights need a full 8-GPU node). SGLang picks the right backends for Hopper automatically, so the recipe stays minimal:
  • MoE runner: Triton, auto-selected for bf16 weights.
  • Attention: FlashAttention-3 with page size 1. MSA (§2.1) is a Blackwell kernel, so M3’s sparse step runs on the built-in Triton path here.
  • CUDA graph: on, with full decode-graph capture.
Validated on 8×H200 — reasoning and tool-call auto-detection plus long-context generation. For prefill/decode disaggregation on Hopper, see §3.4.

3. Advanced Usage

3.1 Reasoning

Launch with --reasoning-parser auto (or toggle Reasoning Parser in the Parsers card of the Playground above). The <mm:think> trace then lands in message.reasoning_content, separate from the final answer in message.content — no client-side tag stripping needed.
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "What is 15% of 240? Explain briefly."}],
    max_tokens=2048,
)

message = response.choices[0].message
print("=============== Reasoning ===============")
print(message.reasoning_content)
print("=============== Answer ==================")
print(message.content)
Output
=============== Reasoning ===============
15% of 240. 15% = 0.15. 240 * 0.15 = 36. Quick check: 10% is 24, 5% is 12, 24 + 12 = 36.
=============== Answer ==================
15% of 240 is **36**.
(10% of 240 = 24, and 5% of 240 = 12; 24 + 12 = 36.)
When streaming, the trace arrives on delta.reasoning_content and the answer on delta.content, so the two sections can be rendered separately in real time:
Example
response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "Solve step by step: what is 15% of 240?"}],
    max_tokens=2048,
    stream=True,
)

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    if getattr(delta, "reasoning_content", None):
        print(delta.reasoning_content, end="", flush=True)  # thinking stream
    if delta.content:
        print(delta.content, end="", flush=True)            # answer stream
print()
Output Example:
Output
[delta.reasoning_content — thinking stream]
Let me solve this step by step.

15% of 240
= 0.15 × 240
= 36

Let me verify: 10% of 240 = 24, 5% of 240 = 12, so 15% = 24 + 12 = 36. ✓

[delta.content — answer stream]
# Solving 15% of 240
## Step 1: Convert the percentage to a decimal
15% = 15/100 = 0.15
## Step 2: Multiply by 240
0.15 × 240 = 36
## Answer
**15% of 240 = 36**

3.2 Tool Calling

Launch with --tool-call-parser auto (or toggle Tool Call Parser in the Parsers card of the Playground above) — it auto-detects M3’s tool-call parser from the chat template. M3 emits tool calls in a custom namespace-token XML format:
Raw model output
]<]minimax[>[<tool_call>
]<]minimax[>[<invoke name="get_weather">]<]minimax[>[<location>Beijing]<]minimax[>[</location>]<]minimax[>[</invoke>
]<]minimax[>[</tool_call>
The parser converts that into the standard OpenAI tool_calls structure:
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools,
)

message = response.choices[0].message
if message.tool_calls:
    for call in message.tool_calls:
        print(f"Tool: {call.function.name}")
        print(f"Args: {call.function.arguments}")
Output
Tool: get_weather
Args: {"location": "Beijing"}
Beyond a single flat call, the parser also supports:
  • Parallel calls — multiple <invoke> blocks inside the single <tool_call> wrapper, surfaced as multiple message.tool_calls entries.
  • Nested object arguments — an object-typed parameter is emitted as nested XML tags and reconstructed into a JSON object.
  • Array arguments — an array-typed parameter uses repeated <item> children and is reconstructed into a JSON list.
For example, a tool with object and array parameters round-trips cleanly:
Output
create_event {"title": "Design sync", "attendees": ["alice", "bob"], "location": {"room": "R2", "floor": 3}}
To return a tool result, append the assistant’s tool_calls turn plus a matching tool message and ask the model to continue — the follow-up answer may place text in reasoning_content as well as content, so print both.

3.3 Multimodal (Vision) Input

Images go through the standard OpenAI image_url content type. The vision tower is always loaded; for image serving add --mm-attention-backend flashinfer_cudnn (the vision-tower backend) to the Blackwell deployment recipe — the text --attention-backend is unchanged (§2.1 note). On AMD, omit --mm-attention-backend and let the encoder use the ROCm default (vision is unvalidated on ROCm — §2.3).
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"
                    },
                },
                {"type": "text", "text": "Describe this image in detail."},
            ],
        }
    ],
    max_tokens=1024,
)
print(response.choices[0].message.content)
Output Example:
Output
This image captures a striking and unusual urban scene on what appears to be a busy New York City street.

**Main Subject:**
A man stands on the rear bumper of a yellow taxi cab (an SUV-style cab, likely a Ford Escape hybrid), operating a full-sized ironing board set up across the back of the vehicle. He is wearing a bright yellow long-sleeved shirt and dark pants, and is actively ironing a blue garment, holding an iron in his right hand.

**Vehicles:**
- The yellow SUV taxi on the right is stationary, its rear hatch serving as the ironing platform.
- A second yellow taxi (a sedan) drives past on the left, captured with motion blur.

**Setting:**
Tall city buildings with classic urban architecture, an American flag, and white lane markings — a bustling downtown area, possibly Midtown Manhattan.
Notes:
  • If the server cannot fetch external URLs, embed the image as a base64 data:image/png;base64,... URI — SGLang decodes it server-side.
  • Multiple images per message are supported; add more image_url entries to the content list.
  • Reasoning and tool calling work the same way for multimodal requests — a vision prompt can still produce a <mm:think> trace and/or tool calls.

3.4 Prefill-Decode (PD) Disaggregation

PD disaggregation runs prefill and decode on separate SGLang servers linked by an RDMA KV-transfer fabric (mooncake or NIXL), fronted by the PD router. M3 needs one thing beyond a dense model: alongside the main KV cache, every sparse “lightning-indexer” layer keeps a K-only index buffer, and that buffer must reach the decode server too — otherwise sparse attention reads stale state. SGLang transfers it alongside the main KV — reusing the same page mapping — so M3 disaggregates correctly with no extra flags. Supported topology (the released MiniMax-M3, whose sparse layers are all K-only):
  • Equal tensor parallelism — the prefill and decode servers run the same --tp.
  • Single pipeline stage — PP = 1 (the default).
  • mooncake or NIXL transfer backend over RDMA / InfiniBand.
Launch the prefill server, then the decode server — the same recipe with --disaggregation-mode decode and no bootstrap port. Pick your hardware:
On Blackwell the MXFP8 recipe — fa4, page size 128, deep_gemm MoE, and the MSA fast path (§2.1) — is auto-selected, so each role adds only the --disaggregation-* flags. This is the validated 2 × 4×B200 setup (TP4 prefill on node A, TP4 decode on node B); point --disaggregation-ib-device at your RDMA NIC(s).
Prefill server (node A)
sglang serve \
  --model-path MiniMaxAI/MiniMax-M3-MXFP8 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 4 \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend nixl \
  --disaggregation-ib-device mlx5_0 \
  --host 0.0.0.0 --port 30000 \
  --disaggregation-bootstrap-port 8998
Decode server (node B)
sglang serve \
  --model-path MiniMaxAI/MiniMax-M3-MXFP8 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 4 \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend nixl \
  --disaggregation-ib-device mlx5_0 \
  --host 0.0.0.0 --port 30001
Then start the PD router, pointing it at the prefill bootstrap (URL plus its --disaggregation-bootstrap-port) and the decode endpoint:
PD router
python3 -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://<prefill-host>:30000 8998 \
  --decode http://<decode-host>:30001 \
  --policy round_robin \
  --host 0.0.0.0 --port 8000
Clients hit the router exactly like a single server — it splits each request across the two stages transparently:
Example
from openai import OpenAI

client = OpenAI(base_url="http://<router-host>:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    max_tokens=64,
)
print(response.choices[0].message.content)
Output Example:
Output
2 + 2 = 4
Validation. PD disaggregation preserves output quality — the K-only sparse index transfers arrive intact and disaggregated output matches non-disaggregated serving. GSM8K is scored with the single sgl-eval harness used by the benchmark card above (full 1319-question split, chat with --thinking); see that card for per-platform single-node accuracy.
  • 2 × 4×B200 (TP4+TP4, MXFP8, NIXL over InfiniBand) — output matches single-node serving. The 2-node PD serving benchmark (512-token input, 256-token output, 16 concurrent — a different workload from the card’s single-node random isl=2048 / osl=256 / conc=64 row, so the throughput figures are not directly comparable) measured mean TTFT 1.1 s and TPOT 16.6 ms (≈ 60 tok/s per stream, ≈ 2.3k tok/s aggregate).
  • 2 × 8×H200 (TP8+TP8, bf16, mooncake) — output matches single-node serving.