DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:
The Instruct repos ship FP4 MoE experts + FP8 attention / dense (one mixed-precision checkpoint covers all GPUs that support FP4). The Base (pre-trained only) variants — DeepSeek-V4-Flash-Base, DeepSeek-V4-Pro-Base — ship pure FP8 mixed and are not for chat / tool calling.Key Features (per the official model card):
Hybrid Attention Architecture — combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for long-context efficiency. At 1M-token context, DeepSeek-V4-Pro uses only ~27% of per-token inference FLOPs and ~10% of KV cache compared with DeepSeek-V3.2.
Manifold-Constrained Hyper-Connections (mHC) — strengthens residual connections, improving signal-propagation stability across layers while preserving expressivity.
Muon optimizer — faster convergence and greater training stability.
Context length: 1M tokens; pre-trained on 32T+ diverse, high-quality tokens.
Three reasoning modes: Non-think (fast, intuitive responses), Think High (conscious logical analysis, slower but more accurate), Think Max (push reasoning to its fullest extent). Recommend a ≥ 384K context window when running Think Max.
Ships with a dedicated encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar (<|DSML|tool_calls> / <|DSML|invoke> / <|DSML|parameter>).
Recommended Generation Parameters:temperature=1.0, top_p=1.0 (per the official model card).License: MIT.Resources:
SGLang offers multiple installation methods. Choose based on your hardware platform.Please refer to the official SGLang installation guide for installation instructions.Docker Image: Use lmsysorg/sglang:latest for all supported hardware platforms (B300 / B200 / GB200 / GB300 / H200 / H100).
SGLang supports three main serving recipes for DeepSeek-V4 with different latency/throughput trade-offs (low-latency, balanced, max-throughput), plus specialized recipes for long-context (cp, prefill context-parallel) and prefill/decode disaggregation (pd-disagg). The interactive generator below emits the exact launch command for any (hardware, variant, recipe) combination.
Concurrency & DeepEP dispatch bufferMust hold: max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP’s dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together.The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload’s peak concurrency and report findings back so the defaults can be revised.MTP (Multi-Token Prediction, EAGLE)
low-latency: steps=3, draft-tokens=4 → largest win at bs=1.
balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.
max-throughput: MTP disabled — at saturation the verify step costs more than it saves.
MTP currently requires SGLANG_ENABLE_SPEC_V2=1.
Hopper (H200) noteWe provide two different options for running DeepSeek-V4 models on Hopper devices (H200)
Original FP4 checkpoints: To run original FP4 checkpoints, we provide two different options for w4a16 MoE kernels: Marlin (--moe-runner-backend marlin) and Flashinfer (--moe-runner-backend flashinfer_mxfp4). For this variant we only support Tensor Parallelism. Complete Pro model can be run on a single H200 node with this option.
Converted FP8 checkpoints: We also provide pre-converted FP8 checkpoints (sgl-project/DeepSeek-V4-Flash-FP8, sgl-project/DeepSeek-V4-Pro-FP8), which support more parallelism and features.
PD-Disagg recipes on H200 may require docker run --privileged --ulimit memlock=-1
(or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake
can discover the IB HCAs; without IB exposure mooncake silently falls back to
TCP, which can lead to garbled KV transfer on large checkpoints.MegaMoEMegaMoE fuses expert dispatch + GEMM into a single kernel for higher throughput
on MoE layers. To enable it, use the MegaMoE toggle in the
command generator above — the generator will swap
--moe-a2a-backend deepep for --moe-a2a-backend megamoe and add the
relevant env vars automatically.Two variants are exposed:
W4A4 — adds SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1 and
SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND=1 to run the custom W4A4
kernel (FP4 activations). Higher throughput with negligible accuracy drop
(~89.5 GPQA on Pro).
Notes:
MegaMoE is not supported on Hopper (H100 / H200) nor on the low-latency / balanced / cp settings — it is only wired into the max-throughput recipe on Blackwell. When running MegaMoE, don’t set --moe-runner-backend manually.
Adjust SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK based on your workload and memory usage. Setting higher number of tokens for MegaMoE requires more HBM space. (recommended: 8320 for max-throughput).
GB300 PD-Disagg cross-pod MNNVLOn some GB300 clusters with cross-pod KV transfer over NVLink, mooncake may
fail with nvlink_transport.cpp:497 Requested address ... not found!. If
this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1
to both prefill and decode sglang serve commands.
PD-Disagg note: if you deployed with the pd-disagg recipe from the generator above, the prefill server is on port 30000, the decode server on 30001, and the router on port 8000 — client traffic should target http://localhost:8000, not :30000.
Enable the deepseek-v4 reasoning parser (check the box in the command panel above) to separate thinking from the final answer into reasoning_content vs content.
Streaming with Thinking Process (Python)
Example
from openai import OpenAIclient = OpenAI( base_url="http://localhost:30000/v1", api_key="EMPTY")response = client.chat.completions.create( model="deepseek-ai/DeepSeek-V4-Flash", messages=[ {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"} ], max_tokens=2048, extra_body={"chat_template_kwargs": {"thinking": True}}, stream=True,)thinking_started = Falsehas_thinking = Falsehas_answer = Falsefor chunk in response: if not chunk.choices: continue delta = chunk.choices[0].delta if getattr(delta, "reasoning_content", None): if not thinking_started: print("=============== Thinking =================", flush=True) thinking_started = True has_thinking = True print(delta.reasoning_content, end="", flush=True) if delta.content: if has_thinking and not has_answer: print("\n=============== Content =================", flush=True) has_answer = True print(delta.content, end="", flush=True)print()
Example Output
Output
We are asked: "What is 15% of 240?" This is a simple percentage problem. I need to provide a step-by-step solution. The user wants the solution explained step by step. I'll calculate 15% of 240: 0.15 * 240 = 36. I'll break it down into steps: understand what percent means, convert percentage to decimal or fraction, then multiply. I'll present the answer clearly.</think>To find 15% of 240, follow these steps:**Step 1: Understand the meaning of percent**"Percent" means "per hundred," so 15% means 15 out of every100, or \( \frac{15}{100} \).**Step2: Convert the percentage to a decimal or fraction**\( 15\% = \frac{15}{100} = 0.15 \)**Step3: Multiply by the given number**Multiply the decimal form by 240:\( 0.15 \times 240 \)**Step4: Perform the multiplication**\( 0.15 \times 240 = 36 \)**Answer:** 15% of 240 is **36**.
Enable the deepseekv4 tool-call parser (check the box in the command panel above) to surface structured tool calls via message.tool_calls.
Python Example with Thinking Process
Example
from openai import OpenAIclient = OpenAI( base_url="http://localhost:30000/v1", api_key="EMPTY")tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "The city name"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["location"], }, }, }]response = client.chat.completions.create( model="deepseek-ai/DeepSeek-V4-Flash", messages=[{"role": "user", "content": "What's the weather in Beijing?"}], tools=tools, extra_body={"chat_template_kwargs": {"thinking": True}}, stream=True,)thinking_started = Falsehas_thinking = Falsetool_calls_accumulator = {}for chunk in response: if not chunk.choices: continue delta = chunk.choices[0].delta if getattr(delta, "reasoning_content", None): if not thinking_started: print("=============== Thinking =================", flush=True) thinking_started = True has_thinking = True print(delta.reasoning_content, end="", flush=True) if getattr(delta, "tool_calls", None): if has_thinking and thinking_started: print("\n=============== Content =================\n", flush=True) thinking_started = False for tool_call in delta.tool_calls: index = tool_call.index if index not in tool_calls_accumulator: tool_calls_accumulator[index] = {"name": None, "arguments": ""} if tool_call.function: if tool_call.function.name: tool_calls_accumulator[index]["name"] = tool_call.function.name if tool_call.function.arguments: tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments if delta.content: print(delta.content, end="", flush=True)for index, tool_call in sorted(tool_calls_accumulator.items()): print(f"Tool Call: {tool_call['name']}") print(f" Arguments: {tool_call['arguments']}")print()
Example Output
Output
The user wants to know the weather in Beijing. I'll use the get_weather function with Beijing as the location. I don't need to specify a unit, so I'll just use the default.</think><|DSML|tool_calls><|DSML|invoke name="get_weather"><|DSML|parameter name="location" string="true">Beijing</|DSML|parameter></|DSML|invoke></|DSML|tool_calls>
HiCache enables multi-tier KV cache offloading (GPU → CPU → Storage), significantly expanding effective context capacity for long-context and multi-turn scenarios. Combined with UnifiedRadixTree, it provides intelligent prefix caching across all tiers.To enable HiCache, use the HiCache toggle in the command generator above:
L2 (GPU + CPU): Offloads cold KV pages to CPU memory. Enables SGLANG_ENABLE_UNIFIED_RADIX_TREE=1 for intelligent hierarchical prefix caching.
We use SGLang’s built-in benchmarking tool with its random dataset — real prompts sampled from ShareGPT_Vicuna_unfiltered and then truncated/padded to a controlled length. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses.
Model Deployment Command: B200 · DeepSeek-V4-Flash · FP4 · Max-Throughput (MegaMoE W4A4). See the command panel above — flip the MegaMoE toggle to W4A4 to reproduce these numbers; the default Max-Throughput recipe uses --moe-a2a-backend deepep and runs slower.