DeepSeek-V4 - SGLang Documentation

Deployment

Install SGLang

For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.

Python (pip / uv)
Docker

Command

pip install --upgrade pip
pip install uv
uv pip install sglang

Then run the Python output of the command panel below in that environment.

A single image — lmsysorg/sglang:latest — covers the datacenter GPUs in this cookbook (B200 / B300 / GB200 / GB300 / H100 / H200). For RTX PRO 6000 (SM120), use the nightly lmsysorg/sglang:dev instead — SM120 support isn’t in :latest yet (see the RTX PRO 6000 note below).

Command

docker pull lmsysorg/sglang:latest

For how to launch the image, see Install → Method 3: Using Docker. A minimal example (substitute the inner sglang serve ... with whatever the command generator below produces):

Command

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your-hf-token>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    sglang serve <use args below>

Pick your hardware + recipe to generate the launch command. The three serving strategies cover the common operating points:

Low-Latency — fastest reply for a single user. Pick for chat.
Balanced — good speed with several users at once. Use for typical multi-user serving.
High-Throughput — most tokens per second across many users. Best for batch jobs.

Panel controls (top of the command box):

Python / Docker — bare sglang serve … for an existing SGLang env, or a docker run … sglang serve … wrap against the per-hardware image from the Install SGLang panel above.
⧉ Copy — copies the current command (with whichever framing is active) to your clipboard.
$ cURL — a sample request against localhost:30000 to confirm the server is up.
⚙ Env — edits the placeholders (HOST_IP, PORT, HF_TOKEN, NODE_RANK, NODE0_IP) the command and cURL share. Persists in localStorage across cookbooks.
Verified / Not Verified badge — green when the (hw, variant, quant, strategy, nodes) combo has been run end-to-end on real hardware; yellow when auto-derived from a neighbor and not yet re-checked.

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing. The base is read live from your Deploy selection — only your overrides change. The knobs come in two flavors:

Built-in SGLang features — parallelism overrides (TP / CP / DP-Attention — DP-Attention’s value is the DP degree, with off to disable), MoE backend + EP, reasoning / tool-call parsers, speculative-decoding presets, prefill/decode disaggregation, HiCache tiers, and HiSparse hierarchical sparse attention (decode-role only — the card appears once PD-Disagg mode is set to decode).
DeepSeek-V4 specific features — MegaMoE W4A8 / W4A4 fused kernel (Blackwell only).

Lines highlighted green are added by your overrides; lines with red strikethrough were in the verified base but stripped by an override. When no override differs from the base cell, the playground inherits the base’s Verified badge; any actual change flips it to Not Verified until the new configuration is run end-to-end and submitted back.

Panel controls reuse Python / Docker · ⧉ Copy · $ cURL · ⚙ Env from the Deploy panel, plus one extra:

Submit ↗ — opens a pre-filled GitHub issue so you can land your override combo as a new verified cookbook cell. Shown only while the badge says Not Verified; click it once you’ve actually run the command on your hardware and confirmed it works.

1. Model Introduction

DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:

Variant	Total params	Active (MoE)	Use
DeepSeek-V4-Flash	284B	13B	single-node serving on B200 / B300 / GB200 / GB300 / H200 (TP=4); H100 (TP=8)
DeepSeek-V4-Pro	1.6T	49B	high-capacity: B200 / B300 (TP=8) · GB300 (TP=4) · H200 FP4 (TP=8) · GB200 (2-node, TP=8) · H200 FP8 (2-node, TP=16) · H100 (2-node, TP=16)

Both Instruct repos ship as FP4 MoE experts + FP8 attention / dense (one mixed-precision checkpoint covers every FP4-capable GPU). Matching *-Base repos ship pure FP8 mixed and are for further pre-training only — not for chat or tool calling. Highlights: hybrid CSA + HCA attention (~27% inference FLOPs / ~10% KV cache vs DSv3.2 at 1M context), manifold-constrained hyper-connections (mHC), Muon optimizer, 1M-token context (32T+ pre-training tokens), three reasoning modes (Non-think / Think High / Think Max — use ≥ 384K context for Think Max), and a dedicated encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar. Recommended generation: temperature=1.0, top_p=1.0. Resources: HuggingFace · Flash · Pro · ModelScope · Flash · Pro.

2. Configuration Tips

Concurrency & DeepEP dispatch buffer Must hold: max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP’s dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together. The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload’s peak concurrency and report findings back so the defaults can be revised. MTP (Multi-Token Prediction, EAGLE)

low-latency: steps=3, draft-tokens=4 → largest win at bs=1.
balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.
high-throughput: MTP disabled — at saturation the verify step costs more than it saves.
MTP runs on the v2 speculative path (SGLANG_ENABLE_SPEC_V2, enabled by default).

EPLB + DeepEP Waterfill (Experimental) For recorded/static EPLB reproduction, first record an expert-distribution file by following Capture expert selection distribution in MoE models. For reproduction runs, use the generated expert_distribution_recorder_*.pt as the initial expert location. Please checkout to latest main branch for this feature. For non-PD reproduction, use:

Command

--moe-a2a-backend deepep \
--deepep-mode auto \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-deepep-waterfill

For PD-Disagg reproduction, use normal mode on the prefill server and low_latency mode on the decode server. Add the same --init-expert-location flag to both commands:

Command

# prefill
--moe-a2a-backend deepep \
--deepep-mode normal \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-deepep-waterfill

# decode
--moe-a2a-backend deepep \
--deepep-mode low_latency \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-deepep-waterfill

You can also add --ep-num-redundant-experts and --eplb-algorithm to customize EPLB placement. MegaMoE is not supported with this DeepEP Waterfill recipe yet. Waterfill routes the shared expert through DeepEP for load balancing, so --enable-deepep-waterfill requires --moe-a2a-backend deepep. FP4 Indexer (Experimental) DeepSeek-V4 uses the default indexer path unless --enable-deepseek-v4-fp4-indexer is set. Enable this flag to use the experimental FP4 C4 indexer on SM100 GPUs with DeepGEMM FP4 indexer support. This path is intended for decode-heavy long-context workloads where reducing indexer cache bandwidth is beneficial.

Command

# Please use the latest main branch for this feature.
sglang serve \
  --model-path deepseek-ai/DeepSeek-V4-Flash \
  --tp 4 \
  --moe-runner-backend flashinfer_mxfp4 \
  --enable-deepseek-v4-fp4-indexer

Hopper (H100 / H200) note Two options are available for running DeepSeek-V4 on Hopper:

Original FP4 checkpoints — apply the W4A16 MoE kernels (Marlin) as the command generator picks for Hopper cells. This path works on both H100 and H200 and is the only option for H100 (no FP8 path). It is TP-only; on H200 the Pro variant fits on a single 8-GPU node, while H100 Pro needs 2 nodes (TP=16).
Converted FP8 checkpoints (H100 and H200 only) — pre-repackaged FP8 weights at sgl-project/DeepSeek-V4-Flash-FP8 and sgl-project/DeepSeek-V4-Pro-FP8 unlock DP-attention + DeepEP and richer parallelism (e.g. Pro TP=16 across 2 nodes).

PD-Disagg recipes on H200 may require docker run --privileged --ulimit memlock=-1 (or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake can discover the IB HCAs; without IB exposure mooncake silently falls back to TCP, which can lead to garbled KV transfer on large checkpoints. RTX PRO 6000 (SM120 / Blackwell Desktop) note RTX PRO 6000 (96 GB) runs Flash only — V4-Pro doesn’t fit on 8× 96 GB. It uses the low-latency / TP-only recipe (TP=4, single node) with the Marlin W4A16 MoE runner and --mem-fraction-static 0.70; the Deploy panel greys out the other recipes for this card. HiCache and MegaMoE are not supported on RTX PRO 6000. For Docker, use the nightly lmsysorg/sglang:dev image — SM120 support isn’t in lmsysorg/sglang:latest yet (the Deploy panel’s Docker mode already points this card at :dev). MegaMoE MegaMoE fuses expert dispatch + GEMM into a single kernel for higher throughput on MoE layers. To enable it, use the MegaMoE chip in the Playground below — the playground will swap --moe-a2a-backend deepep for --moe-a2a-backend megamoe and add the relevant env vars automatically. Two variants are exposed:

W4A8 — default MegaMoE kernel (FP4 weights, FP8 activations).
W4A4 — adds SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1 and SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND=1 to run the custom W4A4 kernel (FP4 activations). Higher throughput with negligible accuracy drop (~89.5 GPQA on Pro).

Notes:

MegaMoE is only supported on Blackwell GPUs (B200 / B300 / GB200 / GB300). The chip is hidden when the Deploy panel’s base cell sits on Hopper (H100 / H200).
MegaMoE is only wired into the high-throughput recipe on Blackwell (per sgl-project/sglang#26451). The chip is hidden on low-latency and balanced — switch to high-throughput to expose it.
When running MegaMoE, don’t set --moe-runner-backend manually.
Adjust SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK based on your workload and memory usage. Setting higher number of tokens for MegaMoE requires more HBM space (recommended: 8320 for high-throughput).

GB300 PD-Disagg cross-pod MNNVL On some GB300 clusters with cross-pod KV transfer over NVLink, mooncake may fail with nvlink_transport.cpp:497 Requested address ... not found!. If this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1 to both prefill and decode sglang serve commands.

3. Advanced Usage

3.1 Reasoning

Enable the deepseek-v4 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer into reasoning_content vs content.

Streaming with Thinking Process (Python)

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    max_tokens=2048,
    extra_body={"chat_template_kwargs": {"thinking": True}},
    stream=True,
)

thinking_started = False
has_thinking = False
has_answer = False

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if getattr(delta, "reasoning_content", None):
        if not thinking_started:
            print("=============== Thinking =================", flush=True)
            thinking_started = True
        has_thinking = True
        print(delta.reasoning_content, end="", flush=True)

    if delta.content:
        if has_thinking and not has_answer:
            print("\n=============== Content =================", flush=True)
            has_answer = True
        print(delta.content, end="", flush=True)

print()

Example Output

Output

We are asked: "What is 15% of 240?" This is a simple percentage problem. I need to provide a step-by-step solution. The user wants the solution explained step by step. I'll calculate 15% of 240: 0.15 * 240 = 36. I'll break it down into steps: understand what percent means, convert percentage to decimal or fraction, then multiply. I'll present the answer clearly.</think>To find 15% of 240, follow these steps:

**Step 1: Understand the meaning of percent**
"Percent" means "per hundred," so 15% means 15 out of every100, or \( \frac{15}{100} \).

**Step2: Convert the percentage to a decimal or fraction**
\( 15\% = \frac{15}{100} = 0.15 \)

**Step3: Multiply by the given number**
Multiply the decimal form by 240:
\( 0.15 \times 240 \)

**Step4: Perform the multiplication**
\( 0.15 \times 240 = 36 \)

**Answer:** 15% of 240 is **36**.

3.2 Tool Calling

Enable the deepseekv4 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls.

Python Example with Thinking Process

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools,
    extra_body={"chat_template_kwargs": {"thinking": True}},
    stream=True,
)

thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if getattr(delta, "reasoning_content", None):
        if not thinking_started:
            print("=============== Thinking =================", flush=True)
            thinking_started = True
        has_thinking = True
        print(delta.reasoning_content, end="", flush=True)

    if getattr(delta, "tool_calls", None):
        if has_thinking and thinking_started:
            print("\n=============== Content =================\n", flush=True)
            thinking_started = False
        for tool_call in delta.tool_calls:
            index = tool_call.index
            if index not in tool_calls_accumulator:
                tool_calls_accumulator[index] = {"name": None, "arguments": ""}
            if tool_call.function:
                if tool_call.function.name:
                    tool_calls_accumulator[index]["name"] = tool_call.function.name
                if tool_call.function.arguments:
                    tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments

    if delta.content:
        print(delta.content, end="", flush=True)

for index, tool_call in sorted(tool_calls_accumulator.items()):
    print(f"Tool Call: {tool_call['name']}")
    print(f"   Arguments: {tool_call['arguments']}")

print()

Example Output

Output

The user wants to know the weather in Beijing. I'll use the get_weather function with Beijing as the location. I don't need to specify a unit, so I'll just use the default.</think>

<｜DSML｜tool_calls>
<｜DSML｜invoke name="get_weather">
<｜DSML｜parameter name="location" string="true">Beijing</｜DSML｜parameter>
</｜DSML｜invoke>
</｜DSML｜tool_calls>

3.3 HiCache (Hierarchical KV Caching)

HiCache enables multi-tier KV cache offloading (GPU → CPU → Storage), significantly expanding effective context capacity for long-context and multi-turn scenarios. Combined with UnifiedRadixTree, it provides intelligent prefix caching across all tiers. To enable HiCache, open the HiCache card in the Playground above and flip Enable:

L2 (GPU + CPU) — leave Storage on auto (default). Cold KV pages spill to CPU pinned memory only.
L3 (GPU + CPU + Storage) — pick a Storage backend (file / mooncake / hf3fs / nixl); the Playground emits the canonical page_first_direct mem-layout + direct IO backend + wait_complete prefetch policy, matching the HiCache best-practices recipe.

The Write policy knob defaults to write_through (the upstream default); switch to write_back / write_through_selective to trade durability for write speed when the storage tier is slow. For more details, see the HiCache documentation.

​Deployment

​Playground

​1. Model Introduction

​2. Configuration Tips

​3. Advanced Usage

​3.1 Reasoning

​3.2 Tool Calling

​3.3 HiCache (Hierarchical KV Caching)

Deployment

Playground

1. Model Introduction

2. Configuration Tips

3. Advanced Usage

3.1 Reasoning

3.2 Tool Calling

3.3 HiCache (Hierarchical KV Caching)