1. Model Introduction
DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:| Variant | Total params | Active (MoE) | Use |
|---|---|---|---|
| DeepSeek-V4-Flash | 284B | 13B | single-node serving: B200 / GB300 / H200 on 4 GPUs |
| DeepSeek-V4-Pro | 1.6T | 49B | high-capacity: B200 8 GPU / GB300 4 GPU / H200 16 GPU (2 nodes) |
DeepSeek-V4-Flash-Base, DeepSeek-V4-Pro-Base — ship pure FP8 mixed and are not for chat / tool calling.
Key Features (per the official model card):
- Hybrid Attention Architecture — combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for long-context efficiency. At 1M-token context, DeepSeek-V4-Pro uses only ~27% of per-token inference FLOPs and ~10% of KV cache compared with DeepSeek-V3.2.
- Manifold-Constrained Hyper-Connections (mHC) — strengthens residual connections, improving signal-propagation stability across layers while preserving expressivity.
- Muon optimizer — faster convergence and greater training stability.
- Context length: 1M tokens; pre-trained on 32T+ diverse, high-quality tokens.
- Three reasoning modes: Non-think (fast, intuitive responses), Think High (conscious logical analysis, slower but more accurate), Think Max (push reasoning to its fullest extent). Recommend a ≥ 384K context window when running Think Max.
- Ships with a dedicated
encoding_dsv4.encode_messagesPython encoder + DSML tool-call grammar (<|DSML|tool_calls>/<|DSML|invoke>/<|DSML|parameter>).
temperature=1.0, top_p=1.0 (per the official model card).
License: MIT.
Resources:
- HuggingFace: DeepSeek-V4-Flash, DeepSeek-V4-Pro
- ModelScope: DeepSeek-V4-Flash, DeepSeek-V4-Pro
2. SGLang Installation
SGLang offers multiple installation methods. Choose based on your hardware platform. Please refer to the official SGLang installation guide for installation instructions. Docker Images by Hardware Platform:| Hardware Platform | Docker Image |
|---|---|
| NVIDIA B200 | lmsysorg/sglang:deepseek-v4-blackwell |
| NVIDIA GB300 | lmsysorg/sglang:deepseek-v4-grace-blackwell |
| NVIDIA H200 | lmsysorg/sglang:deepseek-v4-hopper |
sglang serve ... with whatever the command generator below produces):
Command
3. Model Deployment
SGLang supports three main serving recipes for DeepSeek-V4 with different latency/throughput trade-offs (low-latency, balanced, max-throughput), plus specialized recipes for long-context (cp, prefill context-parallel) and prefill/decode disaggregation (pd-disagg). The interactive generator below emits the exact launch command for any (hardware, variant, recipe) combination.
For H200 GPU deployments, use the SGLang checkpoint under
sgl-project, not the default DeepSeek checkpoint.3.1 Basic Configuration
Interactive Command Generator: Use the selector below to generate the deployment command for your hardware + recipe combination.3.2 Configuration Tips
Concurrency & DeepEP dispatch buffer Must hold:max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP’s dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together.
The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload’s peak concurrency and report findings back so the defaults can be revised.
MTP (Multi-Token Prediction, EAGLE)
low-latency: steps=3, draft-tokens=4 → largest win at bs=1.balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.max-throughput: MTP disabled — at saturation the verify step costs more than it saves.- MTP currently requires
SGLANG_ENABLE_SPEC_V2=1.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, see: Once the server is running (for example via the command generator above), send a request:Command
PD-Disagg note: if you deployed with thepd-disaggrecipe from the generator above, the prefill server is on port30000, the decode server on30001, and the router on port8000— client traffic should targethttp://localhost:8000, not:30000.
4.2 Advanced Usage
4.2.1 Reasoning Parser
Enable thedeepseek-v4 reasoning parser (check the box in the command panel above) to separate thinking from the final answer into reasoning_content vs content.
Streaming with Thinking Process:
Example
Output
4.2.2 Tool Calling
Enable thedeepseekv4 tool-call parser (check the box in the command panel above) to surface structured tool calls via message.tool_calls.
Python Example (with Thinking Process):
Example
Output
5. Benchmark
5.1 Speed Benchmark on Blackwell
Test Environment:- Hardware: NVIDIA B200 GPU (4x)
- Model: DeepSeek-V4-Flash (FP4)
- Tensor Parallelism: 4
- sglang version: Pending update
5.1.1 Latency-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.1.2 Throughput-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
- Test Results:
- DeepSeek-V4-Flash (FP4, Blackwell)
- DeepSeek-V4-Flash (FP8, Hopper)
- DeepSeek-V4-Flash (FP4, Blackwell)
5.2.2 MMLU Benchmark
- Benchmark Command:
Command
- Test Results:
- DeepSeek-V4-Flash (FP4, Blackwell)
- DeepSeek-V4-Flash (FP8, Hopper)
- DeepSeek-V4-Flash (FP4, Blackwell)
5.3 Speed Benchmark on Hopper
Test Environment:- Hardware: NVIDIA H200 GPU (4x)
- Model: DeepSeek-V4-Flash (FP8)
- Tensor Parallelism: 4
- sglang version: Pending update
5.3.1 Latency-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.3.2 Throughput-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
