Deployment
Install SGLang
Install SGLang
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.Then run the Python output of the command panel below in that environment.
- Python (pip / uv)
- Docker
Command
- Low-Latency — fastest reply for a single user. Pick for chat.
- Balanced — good speed with several users at once. Use for typical multi-user serving.
- High-Throughput — most tokens per second across many users. Best for batch jobs.
Panel controls (top of the command box):
- Python / Docker — bare
sglang serve …for an existing SGLang env, or adocker run … sglang serve …wrap against the per-hardware image from the Install SGLang panel above. - ⧉ Copy — copies the current command (with whichever framing is active) to your clipboard.
- $ cURL — a sample request against
localhost:30000to confirm the server is up. - ⚙ Env — edits the placeholders (
HOST_IP,PORT,HF_TOKEN,NODE_RANK,NODE0_IP) the command and cURL share. Persists in localStorage across cookbooks. - Verified / Not Verified badge — green when the
(hw, variant, quant, strategy, nodes)combo has been run end-to-end on real hardware; yellow when auto-derived from a neighbor and not yet re-checked.
Playground
The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing. The base is read live from your Deploy selection — only your overrides change. The knobs come in two flavors:- Built-in SGLang features — parallelism overrides (TP / CP / DP-Attention — DP-Attention’s value is the DP degree, with
offto disable), MoE backend + EP, reasoning / tool-call parsers, speculative-decoding presets, prefill/decode disaggregation, HiCache tiers, and HiSparse hierarchical sparse attention (decode-role only — the card appears once PD-Disagg mode is set to decode). - DeepSeek-V4 specific features — MegaMoE W4A8 / W4A4 fused kernel (Blackwell only).
Panel controls reuse Python / Docker · ⧉ Copy · $ cURL · ⚙ Env from the Deploy panel, plus one extra:
- Submit ↗ — opens a pre-filled GitHub issue so you can land your override combo as a new verified cookbook cell. Shown only while the badge says Not Verified; click it once you’ve actually run the command on your hardware and confirmed it works.
1. Model Introduction
DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:| Variant | Total params | Active (MoE) | Use |
|---|---|---|---|
| DeepSeek-V4-Flash | 284B | 13B | single-node serving on B200 / B300 / GB200 / GB300 / H200 (TP=4); H100 (TP=8) |
| DeepSeek-V4-Pro | 1.6T | 49B | high-capacity: B200 / B300 (TP=8) · GB300 (TP=4) · H200 FP4 (TP=8) · GB200 (2-node, TP=8) · H200 FP8 (2-node, TP=16) · H100 (2-node, TP=16) |
*-Base repos ship pure FP8 mixed and are for further pre-training only — not for chat or tool calling.
Highlights: hybrid CSA + HCA attention (~27% inference FLOPs / ~10% KV cache vs DSv3.2 at 1M context), manifold-constrained hyper-connections (mHC), Muon optimizer, 1M-token context (32T+ pre-training tokens), three reasoning modes (Non-think / Think High / Think Max — use ≥ 384K context for Think Max), and a dedicated encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar.
Recommended generation: temperature=1.0, top_p=1.0.
Resources: HuggingFace · Flash · Pro · ModelScope · Flash · Pro.
2. Configuration Tips
Concurrency & DeepEP dispatch buffer Must hold:max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP’s dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together.
The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload’s peak concurrency and report findings back so the defaults can be revised.
MTP (Multi-Token Prediction, EAGLE)
low-latency: steps=3, draft-tokens=4 → largest win at bs=1.balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.high-throughput: MTP disabled — at saturation the verify step costs more than it saves.- MTP runs on the v2 speculative path (
SGLANG_ENABLE_SPEC_V2, enabled by default).
expert_distribution_recorder_*.pt as
the initial expert location. Please checkout to latest main branch for this feature.
For non-PD reproduction, use:
Command
normal mode on the prefill server and
low_latency mode on the decode server. Add the same --init-expert-location
flag to both commands:
Command
--ep-num-redundant-experts and --eplb-algorithm to customize
EPLB placement.
MegaMoE is not supported with this DeepEP Waterfill recipe yet. Waterfill routes
the shared expert through DeepEP for load balancing, so --enable-deepep-waterfill
requires --moe-a2a-backend deepep.
FP4 Indexer (Experimental)
DeepSeek-V4 uses the default indexer path unless --enable-deepseek-v4-fp4-indexer is set. Enable this flag to use the experimental FP4 C4 indexer on SM100 GPUs with DeepGEMM FP4 indexer support. This path is intended for decode-heavy long-context workloads where reducing indexer cache bandwidth is beneficial.
Command
- Original FP4 checkpoints — apply the W4A16 MoE kernels (Marlin) as the command generator picks for Hopper cells. This path works on both H100 and H200 and is the only option for H100 (no FP8 path). It is TP-only; on H200 the Pro variant fits on a single 8-GPU node, while H100 Pro needs 2 nodes (TP=16).
- Converted FP8 checkpoints (H100 and H200 only) — pre-repackaged FP8 weights at
sgl-project/DeepSeek-V4-Flash-FP8andsgl-project/DeepSeek-V4-Pro-FP8unlock DP-attention + DeepEP and richer parallelism (e.g. Pro TP=16 across 2 nodes).
docker run --privileged --ulimit memlock=-1
(or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake
can discover the IB HCAs; without IB exposure mooncake silently falls back to
TCP, which can lead to garbled KV transfer on large checkpoints.
RTX PRO 6000 (SM120 / Blackwell Desktop) note
RTX PRO 6000 (96 GB) runs Flash only — V4-Pro doesn’t fit on 8× 96 GB. It uses the
low-latency / TP-only recipe (TP=4, single node) with the Marlin W4A16 MoE runner and
--mem-fraction-static 0.70; the Deploy panel greys out the other recipes for this card.
HiCache and MegaMoE are not supported on RTX PRO 6000. For Docker, use the nightly lmsysorg/sglang:dev image — SM120 support isn’t in lmsysorg/sglang:latest yet (the Deploy panel’s Docker mode already points this card at :dev).
MegaMoE
MegaMoE fuses expert dispatch + GEMM into a single kernel for higher throughput
on MoE layers. To enable it, use the MegaMoE chip in the Playground
below — the playground will swap --moe-a2a-backend deepep for
--moe-a2a-backend megamoe and add the relevant env vars automatically.
Two variants are exposed:
- W4A8 — default MegaMoE kernel (FP4 weights, FP8 activations).
- W4A4 — adds
SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1andSGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND=1to run the custom W4A4 kernel (FP4 activations). Higher throughput with negligible accuracy drop (~89.5 GPQA on Pro).
- MegaMoE is only supported on Blackwell GPUs (B200 / B300 / GB200 / GB300). The chip is hidden when the Deploy panel’s base cell sits on Hopper (H100 / H200).
- MegaMoE is only wired into the
high-throughputrecipe on Blackwell (per sgl-project/sglang#26451). The chip is hidden onlow-latencyandbalanced— switch tohigh-throughputto expose it. - When running MegaMoE, don’t set
--moe-runner-backendmanually. - Adjust
SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANKbased on your workload and memory usage. Setting higher number of tokens for MegaMoE requires more HBM space (recommended: 8320 for high-throughput).
nvlink_transport.cpp:497 Requested address ... not found!. If
this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1
to both prefill and decode sglang serve commands.
3. Advanced Usage
3.1 Reasoning
Enable thedeepseek-v4 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer into reasoning_content vs content.
Streaming with Thinking Process (Python)
Streaming with Thinking Process (Python)
Example
Example Output
Example Output
Output
3.2 Tool Calling
Enable thedeepseekv4 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls.
Python Example with Thinking Process
Python Example with Thinking Process
Example
Example Output
Example Output
Output
3.3 HiCache (Hierarchical KV Caching)
HiCache enables multi-tier KV cache offloading (GPU → CPU → Storage), significantly expanding effective context capacity for long-context and multi-turn scenarios. Combined with UnifiedRadixTree, it provides intelligent prefix caching across all tiers. To enable HiCache, open the HiCache card in the Playground above and flip Enable:- L2 (GPU + CPU) — leave Storage on
auto(default). Cold KV pages spill to CPU pinned memory only. - L3 (GPU + CPU + Storage) — pick a Storage backend (
file/mooncake/hf3fs/nixl); the Playground emits the canonicalpage_first_directmem-layout +directIO backend +wait_completeprefetch policy, matching the HiCache best-practices recipe.
write_through (the upstream default); switch to write_back / write_through_selective to trade durability for write speed when the storage tier is slow.
For more details, see the HiCache documentation.