MiniMax-M3 - SGLang Documentation

Deployment

Install SGLang

For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.

Python (pip / uv)
Docker

Command

pip install -U uv
uv venv --python 3.12 && source .venv/bin/activate

# MiniMax-M3 ships in SGLang PR #27944, not yet in a tagged release — install from
# the PR head. The serving runtime is in the base dependencies, so no extra is needed:
git clone https://github.com/sgl-project/sglang.git
cd sglang
git fetch origin pull/27944/head && git checkout FETCH_HEAD
uv pip install -e python

Then run the Python output of the command panel below in that environment. The Docker tab is simpler — its image bundles the CUDA-13 runtime and the #27944 code. Once PR #27944 is merged and released, uv pip install sglang will pull M3 support directly.

Command

# Pull the M3 image the command panel selects for your platform, e.g.:
docker pull lmsysorg/sglang:dev-cu13-minimax-m3

The command panel below fills in the right tag per platform: dev-cu13-minimax-m3 (CUDA 13 — B300, GB200, GB300), dev-cu12-minimax-m3 (CUDA 12 — Hopper H200), or dev-minimax-m3 (default). On AMD Instinct it uses the matching ROCm image (MI300X/MI325X → aigmkt/minimax-m3-sglang-rocm700-mi30x, MI350X/MI355X → aigmkt/minimax-m3-sglang-rocm720-mi35x). For how to launch the image, see Install → Method 3: Using Docker, substituting the inner sglang serve ... with what the command generator produces.

These M3 dev images now bundle MiniMax’s MSA sparse-attention kernel (fmha_sm100), so Blackwell users get the recommended fast path automatically — no manual install needed (see §2.1). On a custom image without it, the same recipe still serves on the built-in Triton sparse path.

Pick your hardware + recipe to generate the launch command.

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.

1. Model Introduction

MiniMax-M3 is MiniMax’s native-multimodal Mixture-of-Experts reasoning model: ~428B total parameters with ~23B activated per token (128 experts, 4 active per token), 60 layers, and a 1M-token context over text, image, and video. Its defining feature is MiniMax Sparse Attention (MSA) — a block-sparse “lightning indexer” attention that keeps long-context cost low (MiniMax reports ~9× prefill / ~15× decode speedup over M2 at 1M context). This page serves the MXFP8 variant (MiniMaxAI/MiniMax-M3-MXFP8, ~440 GB) on NVIDIA Blackwell and AMD Instinct; on NVIDIA Hopper (H200), use the full-precision bfloat16 build MiniMaxAI/MiniMax-M3 (§2.4). Released under the MiniMax Community License. Key characteristics as served by SGLang:

Multimodal (vision + text): accepts interleaved text and images through the OpenAI-compatible chat API (loaded as MiniMaxM3SparseForConditionalGeneration). Image input via URL and base64 is validated; video input has not been tested here.
Reasoning model: emits its chain of thought wrapped in <mm:think>...</mm:think>. Always launch with --reasoning-parser auto — it auto-detects the right parser from the chat template, and SGLang then strips the tags and returns the trace separately in message.reasoning_content.
Native tool calling: a custom namespace-token XML format, parsed into standard OpenAI tool_calls. Always launch with --tool-call-parser auto — it auto-detects the right parser from the chat template. Single, parallel, and nested (object / array) arguments are supported.
Sparse attention: most layers use M3’s “lightning indexer” block-sparse attention (top-k 128-token blocks), which keeps decode cost roughly flat in context length. On Blackwell, MiniMax’s open-source MSA kernel accelerates this path further (§2.1).
MXFP8 quantization across vendors: the MXFP8 MoE weights run natively on NVIDIA Blackwell (B200 / B300 / GB200 / GB300) and on AMD Instinct MI350X/MI355X (gfx950 / CDNA4), both of which have hardware MX-scaled matmul. On AMD MI300X/MI325X (gfx942 / CDNA3) — no hardware MX — SGLang converts the weights to block-fp8 [128,128] at load and serves them on the tuned ROCm kernels (§2.3). The vision tower stays unquantized.

Recommended generation: the model’s generation_config.json sets temperature 1.0 / top_p 0.95, which SGLang applies automatically (the default --sampling-defaults model). The model card additionally suggests top_k 40, but that value is not in generation_config.json, so SGLang does not apply it by default. top_k is a per-request sampling parameter (not a launch flag) — set it per call if you want it, e.g. extra_body={"top_k": 40} with the OpenAI client. Resources: HuggingFace · MSA kernel

2. Configuration Tips

2.1 MSA sparse-attention fast path (recommended for Blackwell users)

MiniMax MSA (fmha_sm100, MIT-licensed) is the recommended Blackwell kernel for M3’s main sparse-attention step — faster and more memory-efficient than the built-in Triton fallback. It ships pre-installed in the M3 dev image (lmsysorg/sglang:dev-minimax-m3, also published under the dev-cu13-minimax-m3 tag), so the Blackwell recipe above engages it automatically with no extra setup — import fmha_sm100 works out of the box and the kernels JIT-compile on first use. It is otherwise purely additive: on a custom image, install it (below) and the recipe engages it automatically; without it the same recipe still serves on the built-in Triton path. The swap is numerically equivalent (cosine ≥ 0.99999 vs Triton), decode stays CUDA-graph-capturable, prefill TTFT drops ~9–12% at 8K–64K context, and the MSA path survives memory configurations where the Triton path OOMs. Requirements (from the MSA README):

GPU: NVIDIA SM100 family — sm_100 (B200 / GB200) and sm_103 (B300 / GB300).
Toolchain: CUDA Toolkit with nvcc ≥ 12.x on PATH (or CUDA_HOME set) — the kernels are JIT-compiled at first import.
Python: ≥ 3.10; OS: Linux — works on both x86_64 and aarch64 (Grace, e.g. GB200 / GB300); the aarch64 build needs no source edits.

Install MSA (only on a custom image) & verify the gate (Python)

The M3 Blackwell dev images above already bundle MSA, so you can skip straight to the gate check. The git clone / pip install steps are only needed on a custom image that doesn’t have fmha_sm100.

Command

# Only on a custom image: --recursive pulls the CUTLASS submodule required for JIT compilation
git clone --recursive https://github.com/MiniMax-AI/MSA.git msa
cd msa && pip install .
# Verify the SGLang gate (True -> MSA engaged on this device; False -> Triton fallback):
python -c "from sglang.srt.layers.attention.minimax_sparse_ops.msa import msa_available; print(msa_available())"

The first import JIT-compiles the kernels, which can take 30 s to a few minutes on a cold nvcc cache — this is normal, not a hang. Subsequent server starts hit the JIT cache.

Warm the JIT cache before a multi-GPU launch. On a cold cache, several tensor-parallel ranks racing to JIT-compile MSA’s plan kernel can leave one rank loading a half-linked module (AttributeError: Module has no function 'plan' at CUDA-graph capture). Run the gate-check python -c "..." (or any single-process fmha_sm100_plan call) once before launching the server — that compiles the kernel single-process, and every rank then hits the warm cache.

The gate requires --attention-backend fa4 (MSA’s sparse blocks are 128 tokens, so the page size must be 128). SGLang auto-forces page_size to 128 for the fa4 backend — including the combined --attention-backend fa4 the M3 recipe uses (#28976) — so --page-size 128 is omitted from the Blackwell cells below. Force the Triton path at any time with the env var SGLANG_DISABLE_MSA=1. MSA is a Blackwell (SM100) kernel and does not apply to the AMD ROCm paths.

For multimodal (image) serving, keep the same text recipe above — --attention-backend fa4 (MSA) is unchanged — and add --mm-attention-backend flashinfer_cudnn for the vision tower. The text and vision-tower attention backends are independent knobs; MSA only touches the language-model sparse attention, not image handling.

2.2 Memory and workload tuning

The NVIDIA Blackwell recipes are validated single-node: B200 at --tp 8 and B300 / GB300 at --tp 4 (4-GPU is also the GB200 / GB300 single-node ceiling). GB200 (sm_100, aarch64) is inferred-supported — both of its axes are validated above (B200 is sm_100; GB300 is sm_103 aarch64) — but not directly benchmarked. The AMD recipes use 8-GPU (--tp 8).

Memory: --mem-fraction-static reserves GPU memory for weights + KV pool; the rest is prefill activation headroom. The value scales with free memory per GPU (card capacity minus per-GPU weight), so it tracks the card more than the TP degree: 0.65 on B200 (180 GB — less headroom once weights are resident) and 0.75 on the larger-memory B300 / GB300 (0.80 on AMD). Lower TP packs more weight per GPU, so a tighter config needs a lower value — B200 needs 0.65 even at --tp 4. Raising it past the validated value is fine only for low-concurrency single-stream serving; it OOMs under high concurrency or long context.
Long context (32K+): keep --mem-fraction-static at the platform default and raise --chunked-prefill-size to 16384. Decode TPOT stays roughly flat in context length thanks to sparse attention; 1K–128K prompts are validated.
Scaling TP: B200 is documented at --tp 8; B300 / GB200 / GB300 at --tp 4 (the single-node cross-family common denominator). On an 8-GPU B300 host you can also raise to --tp 8 for more throughput / KV headroom.
Expert parallelism: to trade latency for throughput add --ep (see Expert Parallelism Deployment). On AMD, set --ep equal to --tp. Shared-experts fusion is automatically disabled when EP > 1; on AMD standard EP the server also disables --enable-aiter-allreduce-fusion automatically to preserve accuracy.
--trust-remote-code is required to load the MiniMax config / processor classes.

2.3 AMD Instinct (ROCm)

MiniMax-M3 runs on AMD Instinct GPUs through two code paths, by architecture — both selected automatically; you still pass --quantization mxfp8 either way:

MI350X / MI355X (gfx950, CDNA4) has hardware MX-scaled matmul, so the MXFP8 weights are served natively. SGLang auto-detects the checkpoint, selects the Triton MiniMax-M3 MoE path with the packaged tuned MXFP8 configs, and enables AITER fused all-reduce for single-node tensor parallelism. The launch command is the NVIDIA recipe minus the Blackwell-only backend flags.
MI300X / MI325X (gfx942, CDNA3) has no hardware MX matmul. SGLang transparently converts the MXFP8 weights to block-fp8 [128,128] at load time, then serves them with the tuned ROCm block-fp8 kernels (--attention-backend aiter, --moe-runner-backend triton; the aiter runner also works and scores marginally higher). On a cold start the first generation can JIT-compile AITER configs and exceed the default warmup/HTTP timeout, so the recipe adds --watchdog-timeout 3600 --skip-server-warmup. The block-fp8 step adds only a small relative error over MXFP8’s native 1×32 scaling — negligible on GSM8K (see the benchmark card).

Select an MI300X/MI325X or MI350X/MI355X tile in the command panel above to get the exact launch command for each path.

The AMD recipes are validated end-to-end on text workloads — chat, reasoning separation, and tool calling. The vision tower was not exercised on ROCm; for image input on AMD, omit the Blackwell --mm-attention-backend flashinfer_cudnn flag and let the encoder use the ROCm default backend, and treat vision as unvalidated on that path.

2.4 Serving on Hopper (H200) with the bf16 build

The MXFP8 kernels are Blackwell-only, so Hopper (H200) serves the full-precision bfloat16 build MiniMaxAI/MiniMax-M3. Select H200 + BF16 in the Deploy panel above for the exact command — it runs at --tp 8 (the bf16 weights need a full 8-GPU node). SGLang picks the right backends for Hopper automatically, so the recipe stays minimal:

MoE runner: Triton, auto-selected for bf16 weights.
Attention: FlashAttention-3 with page size 1. MSA (§2.1) is a Blackwell kernel, so M3’s sparse step runs on the built-in Triton path here.
CUDA graph: on, with full decode-graph capture.

High-concurrency throughput (optional). On Hopper the sparse prefill runs on the Triton path as a separate eager forward, which briefly stalls the in-flight decode batch under heavy concurrent load. Adding --enable-mixed-chunk --chunked-prefill-size 2048 merges the running decodes into the prefill step instead of preempting them, which recovers roughly +10% output throughput and ~10% lower median TPOT at high concurrency on 8×H200, with no change in accuracy. Leave it off for latency-sensitive low-concurrency serving. Validated on 8×H200 — reasoning and tool-call auto-detection plus long-context generation. For prefill/decode disaggregation on Hopper, see §3.4.

3. Advanced Usage

3.1 Reasoning

Launch with --reasoning-parser auto (or toggle Reasoning Parser in the Parsers card of the Playground above). The <mm:think> trace then lands in message.reasoning_content, separate from the final answer in message.content — no client-side tag stripping needed.

Reasoning Example (Python)

Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "What is 15% of 240? Explain briefly."}],
    max_tokens=2048,
)

message = response.choices[0].message
print("=============== Reasoning ===============")
print(message.reasoning_content)
print("=============== Answer ==================")
print(message.content)

Example Output

Output

=============== Reasoning ===============
15% of 240. 15% = 0.15. 240 * 0.15 = 36. Quick check: 10% is 24, 5% is 12, 24 + 12 = 36.
=============== Answer ==================
15% of 240 is **36**.
(10% of 240 = 24, and 5% of 240 = 12; 24 + 12 = 36.)

When streaming, the trace arrives on delta.reasoning_content and the answer on delta.content, so the two sections can be rendered separately in real time:

Streaming Reasoning (Python)

Example

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "Solve step by step: what is 15% of 240?"}],
    max_tokens=2048,
    stream=True,
)

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    if getattr(delta, "reasoning_content", None):
        print(delta.reasoning_content, end="", flush=True)  # thinking stream
    if delta.content:
        print(delta.content, end="", flush=True)            # answer stream
print()

Output Example:

Output

[delta.reasoning_content — thinking stream]
Let me solve this step by step.

15% of 240
= 0.15 × 240
= 36

Let me verify: 10% of 240 = 24, 5% of 240 = 12, so 15% = 24 + 12 = 36. ✓

[delta.content — answer stream]
# Solving 15% of 240
## Step 1: Convert the percentage to a decimal
15% = 15/100 = 0.15
## Step 2: Multiply by 240
0.15 × 240 = 36
## Answer
**15% of 240 = 36**

3.2 Tool Calling

Launch with --tool-call-parser auto (or toggle Tool Call Parser in the Parsers card of the Playground above) — it auto-detects M3’s tool-call parser from the chat template. M3 emits tool calls in a custom namespace-token XML format:

Raw model output

]<]minimax[>[<tool_call>
]<]minimax[>[<invoke name="get_weather">]<]minimax[>[<location>Beijing]<]minimax[>[</location>]<]minimax[>[</invoke>
]<]minimax[>[</tool_call>

The parser converts that into the standard OpenAI tool_calls structure:

Tool Calling Example (Python)

Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools,
)

message = response.choices[0].message
if message.tool_calls:
    for call in message.tool_calls:
        print(f"Tool: {call.function.name}")
        print(f"Args: {call.function.arguments}")

Example Output

Output

Tool: get_weather
Args: {"location": "Beijing"}

Beyond a single flat call, the parser also supports:

Parallel calls — multiple <invoke> blocks inside the single <tool_call> wrapper, surfaced as multiple message.tool_calls entries.
Nested object arguments — an object-typed parameter is emitted as nested XML tags and reconstructed into a JSON object.
Array arguments — an array-typed parameter uses repeated <item> children and is reconstructed into a JSON list.

For example, a tool with object and array parameters round-trips cleanly:

Output

create_event {"title": "Design sync", "attendees": ["alice", "bob"], "location": {"room": "R2", "floor": 3}}

To return a tool result, append the assistant’s tool_calls turn plus a matching tool message and ask the model to continue — the follow-up answer may place text in reasoning_content as well as content, so print both.

3.3 Multimodal (Vision) Input

Images go through the standard OpenAI image_url content type. The vision tower is always loaded; for image serving add --mm-attention-backend flashinfer_cudnn (the vision-tower backend) to the Blackwell deployment recipe — the text --attention-backend is unchanged (§2.1 note). On AMD, omit --mm-attention-backend and let the encoder use the ROCm default (vision is unvalidated on ROCm — §2.3).

Vision Example (Python)

Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"
                    },
                },
                {"type": "text", "text": "Describe this image in detail."},
            ],
        }
    ],
    max_tokens=1024,
)
print(response.choices[0].message.content)

Output Example:

Output

This image captures a striking and unusual urban scene on what appears to be a busy New York City street.

**Main Subject:**
A man stands on the rear bumper of a yellow taxi cab (an SUV-style cab, likely a Ford Escape hybrid), operating a full-sized ironing board set up across the back of the vehicle. He is wearing a bright yellow long-sleeved shirt and dark pants, and is actively ironing a blue garment, holding an iron in his right hand.

**Vehicles:**
- The yellow SUV taxi on the right is stationary, its rear hatch serving as the ironing platform.
- A second yellow taxi (a sedan) drives past on the left, captured with motion blur.

**Setting:**
Tall city buildings with classic urban architecture, an American flag, and white lane markings — a bustling downtown area, possibly Midtown Manhattan.

Notes:

If the server cannot fetch external URLs, embed the image as a base64 data:image/png;base64,... URI — SGLang decodes it server-side.
Multiple images per message are supported; add more image_url entries to the content list.
Reasoning and tool calling work the same way for multimodal requests — a vision prompt can still produce a <mm:think> trace and/or tool calls.

3.4 Prefill-Decode (PD) Disaggregation

PD disaggregation runs prefill and decode on separate SGLang servers linked by an RDMA KV-transfer fabric (mooncake or NIXL), fronted by the PD router. M3 needs one thing beyond a dense model: alongside the main KV cache, every sparse “lightning-indexer” layer keeps a K-only index buffer, and that buffer must reach the decode server too — otherwise sparse attention reads stale state. SGLang transfers it alongside the main KV — reusing the same page mapping — so M3 disaggregates correctly with no extra flags. Supported topology (the released MiniMax-M3, whose sparse layers are all K-only):

Equal tensor parallelism — the prefill and decode servers run the same --tp.
Single pipeline stage — PP = 1 (the default).
mooncake or NIXL transfer backend over RDMA / InfiniBand.

Launch the prefill server, then the decode server — the same recipe with --disaggregation-mode decode and no bootstrap port. Pick your hardware:

Blackwell · MXFP8
Hopper · bf16

On Blackwell the MXFP8 recipe — fa4, page size 128, deep_gemm MoE, and the MSA fast path (§2.1) — is auto-selected, so each role adds only the --disaggregation-* flags. This is the validated 2 × 4×B200 setup (TP4 prefill on node A, TP4 decode on node B); point --disaggregation-ib-device at your RDMA NIC(s).

Prefill server (node A)

sglang serve \
  --model-path MiniMaxAI/MiniMax-M3-MXFP8 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 4 \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend nixl \
  --disaggregation-ib-device mlx5_0 \
  --host 0.0.0.0 --port 30000 \
  --disaggregation-bootstrap-port 8998

Decode server (node B)

sglang serve \
  --model-path MiniMaxAI/MiniMax-M3-MXFP8 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 4 \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend nixl \
  --disaggregation-ib-device mlx5_0 \
  --host 0.0.0.0 --port 30001

On Hopper (H200) M3 runs the bf16 build (§2.4) with Triton MoE and the built-in Triton sparse path, pinned to --page-size 128 so both roles share the page layout the sparse-index transfer relies on. This is the validated 2 × 8×H200 setup (TP8 each).

Prefill server (node A)

sglang serve \
  --model-path MiniMaxAI/MiniMax-M3 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 8 \
  --attention-backend triton \
  --moe-runner-backend triton \
  --page-size 128 \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend mooncake \
  --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
  --host 0.0.0.0 --port 30000 \
  --disaggregation-bootstrap-port 8998

Decode server (node B)

sglang serve \
  --model-path MiniMaxAI/MiniMax-M3 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 8 \
  --attention-backend triton \
  --moe-runner-backend triton \
  --page-size 128 \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend mooncake \
  --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
  --host 0.0.0.0 --port 30001

Then start the PD router, pointing it at the prefill bootstrap (URL plus its --disaggregation-bootstrap-port) and the decode endpoint:

PD router

python3 -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://<prefill-host>:30000 8998 \
  --decode http://<decode-host>:30001 \
  --policy round_robin \
  --host 0.0.0.0 --port 8000

Clients hit the router exactly like a single server — it splits each request across the two stages transparently:

PD Client Example (Python)

Example

from openai import OpenAI

client = OpenAI(base_url="http://<router-host>:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    max_tokens=64,
)
print(response.choices[0].message.content)

Output Example:

Output

2 + 2 = 4

Validation. PD disaggregation preserves output quality — the K-only sparse index transfers arrive intact and disaggregated output matches non-disaggregated serving. GSM8K is scored with the single sgl-eval harness used by the benchmark card above (full 1319-question split, chat with --thinking); see that card for per-platform single-node accuracy.

2 × 4×B200 (TP4+TP4, MXFP8, NIXL over InfiniBand) — output matches single-node serving. The 2-node PD serving benchmark (512-token input, 256-token output, 16 concurrent — a different workload from the card’s single-node random isl=2048 / osl=256 / conc=64 row, so the throughput figures are not directly comparable) measured mean TTFT 1.1 s and TPOT 16.6 ms (≈ 60 tok/s per stream, ≈ 2.3k tok/s aggregate).
2 × 8×H200 (TP8+TP8, bf16, mooncake) — output matches single-node serving.

​Deployment

​Playground

​1. Model Introduction

​2. Configuration Tips

​2.1 MSA sparse-attention fast path (recommended for Blackwell users)

​2.2 Memory and workload tuning

​2.3 AMD Instinct (ROCm)

​2.4 Serving on Hopper (H200) with the bf16 build

​3. Advanced Usage

​3.1 Reasoning

​3.2 Tool Calling

​3.3 Multimodal (Vision) Input

​3.4 Prefill-Decode (PD) Disaggregation

Deployment

Playground

1. Model Introduction

2. Configuration Tips

2.1 MSA sparse-attention fast path (recommended for Blackwell users)

2.2 Memory and workload tuning

2.3 AMD Instinct (ROCm)

2.4 Serving on Hopper (H200) with the bf16 build

3. Advanced Usage

3.1 Reasoning

3.2 Tool Calling

3.3 Multimodal (Vision) Input

3.4 Prefill-Decode (PD) Disaggregation