Skip to main content

1. Model Introduction

DeepSeek V3 is a large-scale Mixture-of-Experts (MoE) language model developed by DeepSeek, designed to deliver strong general-purpose reasoning, coding, and tool-augmented capabilities with high training and inference efficiency. As the latest generation in the DeepSeek model family, DeepSeek V3 introduces systematic architectural and training innovations that significantly improve performance across reasoning, mathematics, coding, and long-context understanding, while maintaining a competitive compute cost. Key highlights include:
  • Efficient MoE architecture: DeepSeek V3 adopts a fine-grained Mixture-of-Experts design with a large number of experts and sparse activation, enabling high model capacity while keeping inference and training costs manageable.
  • Advanced reasoning and coding: The model demonstrates strong performance on mathematical reasoning, logical inference, and real-world coding benchmarks, benefiting from improved data curation and training strategies.
  • Long-context capability: DeepSeek V3 supports extended context lengths, allowing it to handle long documents, complex multi-step reasoning, and agent-style workflows more effectively.
  • Tool use and function calling: The model is trained to support structured outputs and tool invocation, enabling seamless integration with external tools and agent frameworks during inference.

2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.

3. Model Deployment

This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities.

3.2 Configuration Tips

Recommended GPU configurations by weight type:
Weight TypeSupported Hardware
FP8 (recommended)8× H200, 8× B200, 8× MI300X, 2×8× H100/H800/H20
BF16 (upcast from FP8)2×8× H200, 2×8× MI300X, 4×8× H100/H800, 4×8× A100/A800
INT816× A100/A800, 32× L40S, Xeon 6980P CPU, 4× Atlas 800I A3
W4A8 / AWQ / MXFP4 / NVFP48× H20/H100, 4× H200; 8× H100/A100; 8/4× MI355X/MI350X; 8/4× B200
The official DeepSeek-V3 checkpoint is already in FP8 format — do not add --quantization fp8 when serving it.
DeepGEMM precompilation (NVIDIA Hopper / Blackwell): Precompile GEMM kernels before the first server run to avoid JIT overhead (~10 min):
python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
DeepGEMM is enabled by default on Hopper/Blackwell and can be disabled with SGLANG_ENABLE_JIT_DEEPGEMM=0. Data Parallelism Attention (--enable-dp-attention): Recommended for high-throughput scenarios with large batch sizes. Reduces KV-cache duplication across TP ranks. Use --enable-dp-attention --tp 8 --dp 8 on a single 8-GPU node. Not recommended for low-latency, small-batch workloads. NCCL timeout: If model loading is slow and you hit an NCCL timeout, increase it: --dist-timeout 3600.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

4.2.1 Reasoning Parser

DeepSeek-V3 supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:
Command
python -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V3 \
  --reasoning-parser deepseek-v3 \
  --tp 8
Streaming with Thinking Process:
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.7,
    max_tokens=2048,
    extra_body = {"chat_template_kwargs": {"thinking": True}},
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()
Output Example:
Output
=============== Thinking =================
To determine 15% of a number, follow these steps:

**Step 1: Understand the Problem**
You need to find 15% of a given number. Let's assume the number is 240 for this example.

**Step 2: Convert the Percentage to a Decimal**
To work with percentages in calculations, convert the percentage to its decimal form. To do this, divide the percentage by 100.

\[ 15\% = \frac{15}{100} = 0.15 \]

**Step 3: Multiply the Decimal by the Number**
Now, multiply the decimal form of the percentage by the number you want to find the percentage of.

\[ 0.15 \times 240 \]

**Step 4: Perform the Multiplication**
Calculate the product:

\[ 0.15 \times 240 = 36 \]

**Step 5: Conclusion**
Therefore, 15% of 240 is:

\boxed{36}

The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36.
Note: The reasoning parser captures the model’s step-by-step thinking process, allowing you to see how the model arrives at its conclusions.

4.2.2 Tool Calling

DeepSeek-V3 supports tool calling capabilities. Enable the tool call parser: Deployment Command:
Command
python -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V3 \
  --tool-call-parser deepseekv3 \
  --reasoning-parser deepseek-v3 \
  --chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja \
  --tp 8 \
  --host 0.0.0.0 \
  --port 30000
Quick Test (curl):
Command
curl "http://127.0.0.1:30000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "temperature": 0,
    "max_tokens": 100,
    "model": "deepseek-ai/DeepSeek-V3",
    "tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}],
    "messages": [{"role": "user", "content": "How'\''s the weather in Beijing today?"}]
  }'
Use a low temperature (e.g. 0) for more consistent tool call results. The --chat-template flag above provides an improved unified prompt for tool use.
Python Example (with Thinking Process):
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Make request with streaming to see thinking process
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    extra_body = {"chat_template_kwargs": {"thinking": True}},
    temperature=0.7,
    stream=True
)

# Process streaming response
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Accumulate tool calls
        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            # Close thinking section if needed
            if has_thinking and thinking_started:
                print("\n=============== Content =================\n", flush=True)
                thinking_started = False

            for tool_call in delta.tool_calls:
                index = tool_call.index
                if index not in tool_calls_accumulator:
                    tool_calls_accumulator[index] = {
                        'name': None,
                        'arguments': ''
                    }

                if tool_call.function:
                    if tool_call.function.name:
                        tool_calls_accumulator[index]['name'] = tool_call.function.name
                    if tool_call.function.arguments:
                        tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments

        # Print content
        if delta.content:
            print(delta.content, end="", flush=True)

# Print accumulated tool calls
for index, tool_call in sorted(tool_calls_accumulator.items()):
    print(f"🔧 Tool Call: {tool_call['name']}")
    print(f"   Arguments: {tool_call['arguments']}")

print()
Output Example:
Output
🔧 Tool Call: get_weather
   Arguments: {"location": "Beijing", "unit": "celsius"}
Note:
  • The reasoning parser shows how the model decides to use a tool
  • Tool calls are clearly marked with the function name and arguments
  • You can then execute the function and send the result back to continue the conversation
Handling Tool Call Results: Please attach the code blocks below to the previous Python script.
Example
# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
    # Your actual weather API call here
    return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# Send tool result back to the model
messages = [
    {"role": "user", "content": "What's the weather in Beijing?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": '{"location": "Beijing", "unit": "celsius"}'
            }
        }]
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": get_weather("Beijing", "celsius")
    }
]

final_response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=messages,
    temperature=0.7
)

print(final_response.choices[0].message.content)
# Output: "The weather in Beijing is currently 22°C and sunny."

4.2.3 Multi-Token Prediction (EAGLE Speculative Decoding)

SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on EAGLE speculative decoding. With this optimization, decoding speed improves by up to 1.8× at batch size 1 and 1.5× at batch size 32 on H200 TP8. Enable with:
Command
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3-0324 \
  --speculative-algorithm EAGLE \
  --trust-remote-code \
  --tp 8
The default configuration is --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4. Find the best values for your workload with bench_speculative.py. The minimum viable config is --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2.
For large batch sizes (>48), increase --max-running-requests beyond the default of 48 for MTP. Also set --cuda-graph-bs to include your target batch sizes (default captured sizes for speculative decoding: 48).
The spec-v2 overlap scheduler is enabled by default (SGLANG_ENABLE_SPEC_V2=True). It improves performance by overlapping draft and verification stages. Set SGLANG_ENABLE_SPEC_V2=0 to disable.

4.2.4 MLA Optimizations

DeepSeek V3 uses Multi-head Latent Attention (MLA), an attention mechanism that improves inference efficiency. SGLang implements several optimizations:
  • Weight Absorption: Reorders matrix multiplications to improve decoding phase efficiency.
  • MLA Attention Backends: FA3, Flashinfer, FlashMLA, CutlassMLA, TRTLLM MLA (Blackwell), and Triton. FA3 is the default.
  • FP8 Quantization: W8A8 FP8 and KV Cache FP8, with BMM operators for weight-absorbed MLA in FP8.
  • CUDA Graph & Torch.compile: Both MLA and MoE support CUDA Graph and Torch.compile for reduced decoding latency.
  • Chunked Prefix Cache: Increases throughput for long-sequence chunked prefill (FlashAttention3 backend only).
Overall, these optimizations achieve up to output throughput improvement vs. the baseline. Reference: See SGLang v0.3 blog and Slides for details.

4.2.5 Multi-Node Deployment

For multi-node serving and hardware-specific examples: Blog references for large-scale deployment:

5. Benchmark

5.1 Speed Benchmark

Test Environment:
  • Hardware: AMD MI300X GPU (8x)
  • Model: DeepSeek-V3
  • Tensor Parallelism: 8
  • sglang version: 0.5.7
We use SGLang’s built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses.

5.1.1 Latency-Sensitive Benchmark

  • Model Deployment Command:
Command
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 \
  --dp 8 \
  --enable-dp-attention \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --host 0.0.0.0 \
  --port 8000
  • Benchmark Command:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 8000 \
  --model deepseek-ai/DeepSeek-V3 \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 10 \
  --max-concurrency 1
  • Test Results:
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  81.27
Total input tokens:                      1972
Total input text tokens:                 1972
Total input vision tokens:               0
Total generated tokens:                  2784
Total generated tokens (retokenized):    2774
Request throughput (req/s):              0.12
Input token throughput (tok/s):          24.27
Output token throughput (tok/s):         34.26
Peak output token throughput (tok/s):    65.00
Peak concurrent requests:                2
Total token throughput (tok/s):          58.52
Concurrency:                             1.00
Accept length:                           2.61
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8123.17
Median E2E Latency (ms):                 7982.65
---------------Time to First Token----------------
Mean TTFT (ms):                          1080.76
Median TTFT (ms):                        1248.82
P99 TTFT (ms):                           1896.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.04
Median TPOT (ms):                        24.76
P99 TPOT (ms):                           32.09
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.41
Median ITL (ms):                         20.14
P95 ITL (ms):                            60.28
P99 ITL (ms):                            60.99
Max ITL (ms):                            61.49
==================================================

5.1.2 Throughput-Sensitive Benchmark

  • Model Deployment Command:
Command
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 \
  --ep 8 \
  --dp 8 \
  --enable-dp-attention \
  --host 0.0.0.0 \
  --port 8000
  • Benchmark Command:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 8000 \
  --model deepseek-ai/DeepSeek-V3 \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100
  • Test Results:
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  406.16
Total input tokens:                      301701
Total input text tokens:                 301701
Total input vision tokens:               0
Total generated tokens:                  188375
Total generated tokens (retokenized):    187542
Request throughput (req/s):              2.46
Input token throughput (tok/s):          742.81
Output token throughput (tok/s):         463.80
Peak output token throughput (tok/s):    1299.00
Peak concurrent requests:                109
Total token throughput (tok/s):          1206.61
Concurrency:                             87.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   35552.98
Median E2E Latency (ms):                 21466.07
---------------Time to First Token----------------
Mean TTFT (ms):                          1521.51
Median TTFT (ms):                        476.80
P99 TTFT (ms):                           8329.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          214.73
Median TPOT (ms):                        152.00
P99 TPOT (ms):                           1155.85
---------------Inter-Token Latency----------------
Mean ITL (ms):                           182.10
Median ITL (ms):                         79.18
P95 ITL (ms):                            398.60
P99 ITL (ms):                            1488.96
Max ITL (ms):                            43465.60
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

  • Benchmark Command:
Command
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000
  • Test Results:
    • DeepSeek-V3
      Output
      Accuracy: 0.960
      Invalid: 0.000
      Latency: 32.450 s
      Output throughput: 614.211 token/s
      

5.2.2 MMLU Benchmark

  • Benchmark Command:
Command
cd sglang
bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 10 --port 8000
  • Test Results:
    • DeepSeek-V3
      Output
      subject: abstract_algebra, #q:100, acc: 0.800
      subject: anatomy, #q:135, acc: 0.874
      subject: astronomy, #q:152, acc: 0.928
      subject: business_ethics, #q:100, acc: 0.880
      subject: clinical_knowledge, #q:265, acc: 0.928
      subject: college_biology, #q:144, acc: 0.965
      subject: college_chemistry, #q:100, acc: 0.670
      subject: college_computer_science, #q:100, acc: 0.840
      subject: college_mathematics, #q:100, acc: 0.800
      subject: college_medicine, #q:173, acc: 0.861
      Total latency: 58.339
      Average accuracy: 0.871