Deployment
Install SGLang
Install SGLang
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.Then run the Python output of the command panel below in that environment. The Docker tab is simpler — its image bundles the CUDA-13 runtime and the #27944 code. Once PR #27944 is merged and released,
- Python (pip / uv)
- Docker
Command
uv pip install sglang will pull M3 support directly.Playground
The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.1. Model Introduction
MiniMax-M3 is MiniMax’s native-multimodal Mixture-of-Experts reasoning model: ~428B total parameters with ~23B activated per token (128 experts, 4 active per token), 60 layers, and a 1M-token context over text, image, and video. Its defining feature is MiniMax Sparse Attention (MSA) — a block-sparse “lightning indexer” attention that keeps long-context cost low (MiniMax reports ~9× prefill / ~15× decode speedup over M2 at 1M context). This page serves the MXFP8 variant (MiniMaxAI/MiniMax-M3-MXFP8, ~440 GB) on NVIDIA Blackwell and AMD Instinct; on NVIDIA Hopper (H200), use the full-precision bfloat16 build MiniMaxAI/MiniMax-M3 (§2.4). Released under the MiniMax Community License.
Key characteristics as served by SGLang:
- Multimodal (vision + text): accepts interleaved text and images through the OpenAI-compatible chat API (loaded as
MiniMaxM3SparseForConditionalGeneration). Image input via URL and base64 is validated; video input has not been tested here. - Reasoning model: emits its chain of thought wrapped in
<mm:think>...</mm:think>. Always launch with--reasoning-parser auto— it auto-detects the right parser from the chat template, and SGLang then strips the tags and returns the trace separately inmessage.reasoning_content. - Native tool calling: a custom namespace-token XML format, parsed into standard OpenAI
tool_calls. Always launch with--tool-call-parser auto— it auto-detects the right parser from the chat template. Single, parallel, and nested (object / array) arguments are supported. - Sparse attention: most layers use M3’s “lightning indexer” block-sparse attention (top-k 128-token blocks), which keeps decode cost roughly flat in context length. On Blackwell, MiniMax’s open-source MSA kernel accelerates this path further (§2.1).
- MXFP8 quantization across vendors: the MXFP8 MoE weights run natively on NVIDIA Blackwell (B200 / B300 / GB200 / GB300) and on AMD Instinct MI350X/MI355X (gfx950 / CDNA4), both of which have hardware MX-scaled matmul. On AMD MI300X/MI325X (gfx942 / CDNA3) — no hardware MX — SGLang converts the weights to block-fp8
[128,128]at load and serves them on the tuned ROCm kernels (§2.3). The vision tower stays unquantized.
temperature 1.0, top_p 0.95, top_k 40 — SGLang applies these automatically from the model’s generation_config.json.
Resources: HuggingFace · MSA kernel
2. Configuration Tips
2.1 MSA sparse-attention fast path (recommended for Blackwell users)
MiniMax MSA (fmha_sm100, MIT-licensed) is the recommended Blackwell kernel for M3’s main sparse-attention step — faster and more memory-efficient than the built-in Triton fallback. It is purely additive — install it and the recipe above engages it automatically; without it the same recipe still serves on the built-in Triton path. The swap is numerically equivalent (cosine ≥ 0.99999 vs Triton), decode stays CUDA-graph-capturable, prefill TTFT drops ~9–12% at 8K–64K context, and the MSA path survives memory configurations where the Triton path OOMs.
Requirements (from the MSA README):
- GPU: NVIDIA SM100 family — sm_100 (B200 / GB200) and sm_103 (B300 / GB300).
- Toolchain: CUDA Toolkit with
nvcc≥ 12.x onPATH(orCUDA_HOMEset) — the kernels are JIT-compiled at first import. - Python: ≥ 3.10; OS: Linux — works on both x86_64 and aarch64 (Grace, e.g. GB200 / GB300); the aarch64 build needs no source edits.
Install MSA & verify the gate (Python)
Install MSA & verify the gate (Python)
Command
The first import JIT-compiles the kernels, which can take 30 s to a few minutes on a cold
nvcc cache — this is normal, not a hang. Subsequent server starts hit the JIT cache.--attention-backend fa4 --page-size 128 (already part of the Blackwell recipe above; on current main these are also the auto-selected M3 defaults on SM100 GPUs). Force the Triton path at any time with the env var SGLANG_DISABLE_MSA=1. MSA is a Blackwell (SM100) kernel and does not apply to the AMD ROCm paths.
For multimodal (image) serving, keep the same text recipe above —
--attention-backend fa4 --page-size 128 (MSA) is unchanged — and add --mm-attention-backend flashinfer_cudnn for the vision tower. The text and vision-tower attention backends are independent knobs; MSA only touches the language-model sparse attention, not image handling.2.2 Memory and workload tuning
The NVIDIA Blackwell recipe is the validated single-node 4-GPU (--tp 4) config, which is also the GB200 / GB300 single-node ceiling. It runs identically on B200 (sm_100), B300 (sm_103), and GB300 (sm_103, aarch64); GB200 (sm_100, aarch64) is inferred-supported — both of its axes are validated above — but not directly benchmarked. The AMD recipes use 8-GPU (--tp 8).
- Memory:
--mem-fraction-statictrades KV-pool capacity against prefill activation headroom —0.75is the safe default on NVIDIA (0.80on AMD). A higher value is fine at low concurrency but OOMs under high concurrency or long context, so raise it only for interactive single-stream serving. - Long context (32K+): keep
--mem-fraction-staticat the platform default and raise--chunked-prefill-sizeto16384. Decode TPOT stays roughly flat in context length thanks to sparse attention; 1K–128K prompts are validated. - 8-GPU nodes: B200 / B300 hosts with 8 GPUs can use
--tp 8for more throughput / KV headroom; tp4 is documented as the NVIDIA cross-family common denominator. - Expert parallelism: to trade latency for throughput add
--ep(see Expert Parallelism Deployment). On AMD, set--epequal to--tp. Shared-experts fusion is automatically disabled when EP > 1; on AMD standard EP the server also disables--enable-aiter-allreduce-fusionautomatically to preserve accuracy. --trust-remote-codeis required to load the MiniMax config / processor classes.
2.3 AMD Instinct (ROCm)
MiniMax-M3 runs on AMD Instinct GPUs through two code paths, by architecture — both selected automatically; you still pass--quantization mxfp8 either way:
- MI350X / MI355X (gfx950, CDNA4) has hardware MX-scaled matmul, so the MXFP8 weights are served natively. SGLang auto-detects the checkpoint, selects the Triton MiniMax-M3 MoE path with the packaged tuned MXFP8 configs, and enables AITER fused all-reduce for single-node tensor parallelism. The launch command is the NVIDIA recipe minus the Blackwell-only backend flags.
- MI300X / MI325X (gfx942, CDNA3) has no hardware MX matmul. SGLang transparently converts the MXFP8 weights to block-fp8
[128,128]at load time, then serves them with the tuned ROCm block-fp8 kernels (--attention-backend aiter,--moe-runner-backend triton; theaiterrunner also works and scores marginally higher). On a cold start the first generation can JIT-compile AITER configs and exceed the default warmup/HTTP timeout, so the recipe adds--watchdog-timeout 3600 --skip-server-warmup. The block-fp8 step adds only a small relative error over MXFP8’s native1×32scaling — negligible on GSM8K (see the benchmark card).
The AMD recipes are validated end-to-end on text workloads — chat, reasoning separation, and tool calling. The vision tower was not exercised on ROCm; for image input on AMD, omit the Blackwell
--mm-attention-backend flashinfer_cudnn flag and let the encoder use the ROCm default backend, and treat vision as unvalidated on that path.2.4 Serving on Hopper (H200) with the bf16 build
The MXFP8 kernels are Blackwell-only, so Hopper (H200) serves the full-precision bfloat16 buildMiniMaxAI/MiniMax-M3. Select H200 + BF16 in the Deploy panel above for the exact command — it runs at --tp 8 (the bf16 weights need a full 8-GPU node). SGLang picks the right backends for Hopper automatically, so the recipe stays minimal:
- MoE runner: Triton, auto-selected for bf16 weights.
- Attention: FlashAttention-3 with page size 1. MSA (§2.1) is a Blackwell kernel, so M3’s sparse step runs on the built-in Triton path here.
- CUDA graph: on, with full decode-graph capture.
3. Advanced Usage
3.1 Reasoning
Launch with--reasoning-parser auto (or toggle Reasoning Parser in the Parsers card of the Playground above). The <mm:think> trace then lands in message.reasoning_content, separate from the final answer in message.content — no client-side tag stripping needed.
Reasoning Example (Python)
Reasoning Example (Python)
Example
Example Output
Example Output
Output
delta.reasoning_content and the answer on delta.content, so the two sections can be rendered separately in real time:
Streaming Reasoning (Python)
Streaming Reasoning (Python)
Example
Output
3.2 Tool Calling
Launch with--tool-call-parser auto (or toggle Tool Call Parser in the Parsers card of the Playground above) — it auto-detects M3’s tool-call parser from the chat template. M3 emits tool calls in a custom namespace-token XML format:
Raw model output
tool_calls structure:
Tool Calling Example (Python)
Tool Calling Example (Python)
Example
Example Output
Example Output
Output
- Parallel calls — multiple
<invoke>blocks inside the single<tool_call>wrapper, surfaced as multiplemessage.tool_callsentries. - Nested object arguments — an
object-typed parameter is emitted as nested XML tags and reconstructed into a JSON object. - Array arguments — an
array-typed parameter uses repeated<item>children and is reconstructed into a JSON list.
Output
tool_calls turn plus a matching tool message and ask the model to continue — the follow-up answer may place text in reasoning_content as well as content, so print both.
3.3 Multimodal (Vision) Input
Images go through the standard OpenAIimage_url content type. The vision tower is always loaded; for image serving add --mm-attention-backend flashinfer_cudnn (the vision-tower backend) to the Blackwell deployment recipe — the text --attention-backend is unchanged (§2.1 note). On AMD, omit --mm-attention-backend and let the encoder use the ROCm default (vision is unvalidated on ROCm — §2.3).
Vision Example (Python)
Vision Example (Python)
Example
Output
- If the server cannot fetch external URLs, embed the image as a base64
data:image/png;base64,...URI — SGLang decodes it server-side. - Multiple images per message are supported; add more
image_urlentries to thecontentlist. - Reasoning and tool calling work the same way for multimodal requests — a vision prompt can still produce a
<mm:think>trace and/or tool calls.
3.4 Prefill-Decode (PD) Disaggregation
PD disaggregation runs prefill and decode on separate SGLang servers linked by an RDMA KV-transfer fabric (mooncake or NIXL), fronted by the PD router. M3 needs one thing beyond a dense model: alongside the main KV cache, every sparse “lightning-indexer” layer keeps a K-only index buffer, and that buffer must reach the decode server too — otherwise sparse attention reads stale state. SGLang transfers it alongside the main KV — reusing the same page mapping — so M3 disaggregates correctly with no extra flags. Supported topology (the released MiniMax-M3, whose sparse layers are all K-only):- Equal tensor parallelism — the prefill and decode servers run the same
--tp. - Single pipeline stage — PP = 1 (the default).
- mooncake or NIXL transfer backend over RDMA / InfiniBand.
--disaggregation-mode decode and no bootstrap port. Pick your hardware:
- Blackwell · MXFP8
- Hopper · bf16
On Blackwell the MXFP8 recipe — fa4, page size 128, deep_gemm MoE, and the MSA fast path (§2.1) — is auto-selected, so each role adds only the
--disaggregation-* flags. This is the validated 2 × 4×B200 setup (TP4 prefill on node A, TP4 decode on node B); point --disaggregation-ib-device at your RDMA NIC(s).Prefill server (node A)
Decode server (node B)
--disaggregation-bootstrap-port) and the decode endpoint:
PD router
PD Client Example (Python)
PD Client Example (Python)
Example
Output
--thinking); see that card for per-platform single-node accuracy.
- 2 × 4×B200 (TP4+TP4, MXFP8, NIXL over InfiniBand) — output matches single-node serving. The 2-node PD serving benchmark (512-token input, 256-token output, 16 concurrent — a different workload from the card’s single-node
randomisl=2048 / osl=256 / conc=64 row, so the throughput figures are not directly comparable) measured mean TTFT 1.1 s and TPOT 16.6 ms (≈ 60 tok/s per stream, ≈ 2.3k tok/s aggregate). - 2 × 8×H200 (TP8+TP8, bf16, mooncake) — output matches single-node serving.
