Skip to main content

1. Model Introduction

DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:
VariantTotal paramsActive (MoE)Use
DeepSeek-V4-Flash284B13Bsingle-node serving: B200 / GB300 / H200 on 4 GPUs
DeepSeek-V4-Pro1.6T49Bhigh-capacity: B200 8 GPU / GB300 4 GPU / H200 16 GPU (2 nodes)
The Instruct repos ship FP4 MoE experts + FP8 attention / dense (one mixed-precision checkpoint covers all GPUs that support FP4). The Base (pre-trained only) variants — DeepSeek-V4-Flash-Base, DeepSeek-V4-Pro-Base — ship pure FP8 mixed and are not for chat / tool calling. Key Features (per the official model card):
  • Hybrid Attention Architecture — combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for long-context efficiency. At 1M-token context, DeepSeek-V4-Pro uses only ~27% of per-token inference FLOPs and ~10% of KV cache compared with DeepSeek-V3.2.
  • Manifold-Constrained Hyper-Connections (mHC) — strengthens residual connections, improving signal-propagation stability across layers while preserving expressivity.
  • Muon optimizer — faster convergence and greater training stability.
  • Context length: 1M tokens; pre-trained on 32T+ diverse, high-quality tokens.
  • Three reasoning modes: Non-think (fast, intuitive responses), Think High (conscious logical analysis, slower but more accurate), Think Max (push reasoning to its fullest extent). Recommend a ≥ 384K context window when running Think Max.
  • Ships with a dedicated encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar (<|DSML|tool_calls> / <|DSML|invoke> / <|DSML|parameter>).
Recommended Generation Parameters: temperature=1.0, top_p=1.0 (per the official model card). License: MIT. Resources:

2. SGLang Installation

SGLang offers multiple installation methods. Choose based on your hardware platform. Please refer to the official SGLang installation guide for installation instructions. Docker Images by Hardware Platform:
Hardware PlatformDocker Image
NVIDIA B200lmsysorg/sglang:deepseek-v4-blackwell
NVIDIA GB300lmsysorg/sglang:deepseek-v4-grace-blackwell
NVIDIA H200lmsysorg/sglang:deepseek-v4-hopper
For how to actually launch one of these images, see Install → Method 3: Using Docker. A minimal example (substitute the image tag for your platform and the inner sglang serve ... with whatever the command generator below produces):
Command
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your-hf-token>" \
    --ipc=host \
    lmsysorg/sglang:deepseek-v4-blackwell \
    sglang serve <use args below>

3. Model Deployment

SGLang supports three main serving recipes for DeepSeek-V4 with different latency/throughput trade-offs (low-latency, balanced, max-throughput), plus specialized recipes for long-context (cp, prefill context-parallel) and prefill/decode disaggregation (pd-disagg). The interactive generator below emits the exact launch command for any (hardware, variant, recipe) combination.
For H200 GPU deployments, use the SGLang checkpoint under sgl-project, not the default DeepSeek checkpoint.

3.1 Basic Configuration

Interactive Command Generator: Use the selector below to generate the deployment command for your hardware + recipe combination.

3.2 Configuration Tips

Concurrency & DeepEP dispatch buffer Must hold: max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP’s dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together. The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload’s peak concurrency and report findings back so the defaults can be revised. MTP (Multi-Token Prediction, EAGLE)
  • low-latency: steps=3, draft-tokens=4 → largest win at bs=1.
  • balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.
  • max-throughput: MTP disabled — at saturation the verify step costs more than it saves.
  • MTP currently requires SGLANG_ENABLE_SPEC_V2=1.
Hopper (H200) note The H200 image and checkpoint are currently being uploaded — public path coming shortly.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, see: Once the server is running (for example via the command generator above), send a request:
Command
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "What is 15% of 240?"}]
  }'
PD-Disagg note: if you deployed with the pd-disagg recipe from the generator above, the prefill server is on port 30000, the decode server on 30001, and the router on port 8000 — client traffic should target http://localhost:8000, not :30000.

4.2 Advanced Usage

4.2.1 Reasoning Parser

Enable the deepseek-v4 reasoning parser (check the box in the command panel above) to separate thinking from the final answer into reasoning_content vs content. Streaming with Thinking Process:
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    max_tokens=2048,
    extra_body={"chat_template_kwargs": {"thinking": True}},
    stream=True,
)

thinking_started = False
has_thinking = False
has_answer = False

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if getattr(delta, "reasoning_content", None):
        if not thinking_started:
            print("=============== Thinking =================", flush=True)
            thinking_started = True
        has_thinking = True
        print(delta.reasoning_content, end="", flush=True)

    if delta.content:
        if has_thinking and not has_answer:
            print("\n=============== Content =================", flush=True)
            has_answer = True
        print(delta.content, end="", flush=True)

print()
Output Example:
Output
Pending update — replace with real server output after deployment.

4.2.2 Tool Calling

Enable the deepseekv4 tool-call parser (check the box in the command panel above) to surface structured tool calls via message.tool_calls. Python Example (with Thinking Process):
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools,
    extra_body={"chat_template_kwargs": {"thinking": True}},
    stream=True,
)

thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if getattr(delta, "reasoning_content", None):
        if not thinking_started:
            print("=============== Thinking =================", flush=True)
            thinking_started = True
        has_thinking = True
        print(delta.reasoning_content, end="", flush=True)

    if getattr(delta, "tool_calls", None):
        if has_thinking and thinking_started:
            print("\n=============== Content =================\n", flush=True)
            thinking_started = False
        for tool_call in delta.tool_calls:
            index = tool_call.index
            if index not in tool_calls_accumulator:
                tool_calls_accumulator[index] = {"name": None, "arguments": ""}
            if tool_call.function:
                if tool_call.function.name:
                    tool_calls_accumulator[index]["name"] = tool_call.function.name
                if tool_call.function.arguments:
                    tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments

    if delta.content:
        print(delta.content, end="", flush=True)

for index, tool_call in sorted(tool_calls_accumulator.items()):
    print(f"Tool Call: {tool_call['name']}")
    print(f"   Arguments: {tool_call['arguments']}")

print()
Output Example:
Output
Pending update — replace with real server output after deployment.

5. Benchmark

5.1 Speed Benchmark on Blackwell

Test Environment:
  • Hardware: NVIDIA B200 GPU (4x)
  • Model: DeepSeek-V4-Flash (FP4)
  • Tensor Parallelism: 4
  • sglang version: Pending update
We use SGLang’s built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses.

5.1.1 Latency-Sensitive Benchmark

Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 10 \
  --max-concurrency 1
  • Test Results:
Output
Pending update — replace with real bench_serving output after the latency run.

5.1.2 Throughput-Sensitive Benchmark

Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100
  • Test Results:
Output
Pending update — replace with real bench_serving output after the throughput run.

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

  • Benchmark Command:
Command
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 30000
  • Test Results:
    • DeepSeek-V4-Flash (FP4, Blackwell)
      Pending update
      
    • DeepSeek-V4-Flash (FP8, Hopper)
      Pending update
      

5.2.2 MMLU Benchmark

  • Benchmark Command:
Command
cd sglang
bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 10 --port 30000
  • Test Results:
    • DeepSeek-V4-Flash (FP4, Blackwell)
      Pending update
      
    • DeepSeek-V4-Flash (FP8, Hopper)
      Pending update
      

5.3 Speed Benchmark on Hopper

Test Environment:
  • Hardware: NVIDIA H200 GPU (4x)
  • Model: DeepSeek-V4-Flash (FP8)
  • Tensor Parallelism: 4
  • sglang version: Pending update

5.3.1 Latency-Sensitive Benchmark

Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 10 \
  --max-concurrency 1
  • Test Results:
Output
Pending update — replace with real bench_serving output after the latency run.

5.3.2 Throughput-Sensitive Benchmark

Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100
  • Test Results:
Output
Pending update — replace with real bench_serving output after the throughput run.