1. Model Introduction
DeepSeek V3 is a large-scale Mixture-of-Experts (MoE) language model developed by DeepSeek, designed to deliver strong general-purpose reasoning, coding, and tool-augmented capabilities with high training and inference efficiency. As the latest generation in the DeepSeek model family, DeepSeek V3 introduces systematic architectural and training innovations that significantly improve performance across reasoning, mathematics, coding, and long-context understanding, while maintaining a competitive compute cost.
Key highlights include:
- Efficient MoE architecture: DeepSeek V3 adopts a fine-grained Mixture-of-Experts design with a large number of experts and sparse activation, enabling high model capacity while keeping inference and training costs manageable.
- Advanced reasoning and coding: The model demonstrates strong performance on mathematical reasoning, logical inference, and real-world coding benchmarks, benefiting from improved data curation and training strategies.
- Long-context capability: DeepSeek V3 supports extended context lengths, allowing it to handle long documents, complex multi-step reasoning, and agent-style workflows more effectively.
- Tool use and function calling: The model is trained to support structured outputs and tool invocation, enabling seamless integration with external tools and agent frameworks during inference.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.
Please refer to the official SGLang installation guide for installation instructions.
3. Model Deployment
This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities.
3.2 Configuration Tips
Recommended GPU configurations by weight type:
| Weight Type | Supported Hardware |
|---|
| FP8 (recommended) | 8× H200, 8× B200, 8× MI300X, 2×8× H100/H800/H20 |
| BF16 (upcast from FP8) | 2×8× H200, 2×8× MI300X, 4×8× H100/H800, 4×8× A100/A800 |
| INT8 | 16× A100/A800, 32× L40S, Xeon 6980P CPU, 4× Atlas 800I A3 |
| W4A8 / AWQ / MXFP4 / NVFP4 | 8× H20/H100, 4× H200; 8× H100/A100; 8/4× MI355X/MI350X; 8/4× B200 |
The official DeepSeek-V3 checkpoint is already in FP8 format — do not add --quantization fp8 when serving it.
DeepGEMM precompilation (NVIDIA Hopper / Blackwell): Precompile GEMM kernels before the first server run to avoid JIT overhead (~10 min):
python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
DeepGEMM is enabled by default on Hopper/Blackwell and can be disabled with SGLANG_ENABLE_JIT_DEEPGEMM=0.
Data Parallelism Attention (--enable-dp-attention): Recommended for high-throughput scenarios with large batch sizes. Reduces KV-cache duplication across TP ranks. Use --enable-dp-attention --tp 8 --dp 8 on a single 8-GPU node. Not recommended for low-latency, small-batch workloads.
NCCL timeout: If model loading is slow and you hit an NCCL timeout, increase it: --dist-timeout 3600.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:
4.2 Advanced Usage
4.2.1 Reasoning Parser
DeepSeek-V3 supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:
python -m sglang.launch_server \
--model deepseek-ai/DeepSeek-V3 \
--reasoning-parser deepseek-v3 \
--tp 8
Streaming with Thinking Process:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
temperature=0.7,
max_tokens=2048,
extra_body = {"chat_template_kwargs": {"thinking": True}},
stream=True
)
# Process the stream
has_thinking = False
has_answer = False
thinking_started = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
To determine 15% of a number, follow these steps:
**Step 1: Understand the Problem**
You need to find 15% of a given number. Let's assume the number is 240 for this example.
**Step 2: Convert the Percentage to a Decimal**
To work with percentages in calculations, convert the percentage to its decimal form. To do this, divide the percentage by 100.
\[ 15\% = \frac{15}{100} = 0.15 \]
**Step 3: Multiply the Decimal by the Number**
Now, multiply the decimal form of the percentage by the number you want to find the percentage of.
\[ 0.15 \times 240 \]
**Step 4: Perform the Multiplication**
Calculate the product:
\[ 0.15 \times 240 = 36 \]
**Step 5: Conclusion**
Therefore, 15% of 240 is:
\boxed{36}
The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36.
Note: The reasoning parser captures the model’s step-by-step thinking process, allowing you to see how the model arrives at its conclusions.
DeepSeek-V3 supports tool calling capabilities. Enable the tool call parser:
Deployment Command:
python -m sglang.launch_server \
--model deepseek-ai/DeepSeek-V3 \
--tool-call-parser deepseekv3 \
--reasoning-parser deepseek-v3 \
--chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja \
--tp 8 \
--host 0.0.0.0 \
--port 30000
Quick Test (curl):
curl "http://127.0.0.1:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"temperature": 0,
"max_tokens": 100,
"model": "deepseek-ai/DeepSeek-V3",
"tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}],
"messages": [{"role": "user", "content": "How'\''s the weather in Beijing today?"}]
}'
Use a low temperature (e.g. 0) for more consistent tool call results. The --chat-template flag above provides an improved unified prompt for tool use.
Python Example (with Thinking Process):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Make request with streaming to see thinking process
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
extra_body = {"chat_template_kwargs": {"thinking": True}},
temperature=0.7,
stream=True
)
# Process streaming response
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Accumulate tool calls
if hasattr(delta, 'tool_calls') and delta.tool_calls:
# Close thinking section if needed
if has_thinking and thinking_started:
print("\n=============== Content =================\n", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
index = tool_call.index
if index not in tool_calls_accumulator:
tool_calls_accumulator[index] = {
'name': None,
'arguments': ''
}
if tool_call.function:
if tool_call.function.name:
tool_calls_accumulator[index]['name'] = tool_call.function.name
if tool_call.function.arguments:
tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments
# Print content
if delta.content:
print(delta.content, end="", flush=True)
# Print accumulated tool calls
for index, tool_call in sorted(tool_calls_accumulator.items()):
print(f"🔧 Tool Call: {tool_call['name']}")
print(f" Arguments: {tool_call['arguments']}")
print()
Output Example:
🔧 Tool Call: get_weather
Arguments: {"location": "Beijing", "unit": "celsius"}
Note:
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
Handling Tool Call Results:
Please attach the code blocks below to the previous Python script.
# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
# Your actual weather API call here
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."
# Send tool result back to the model
messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Beijing", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": get_weather("Beijing", "celsius")
}
]
final_response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=messages,
temperature=0.7
)
print(final_response.choices[0].message.content)
# Output: "The weather in Beijing is currently 22°C and sunny."
4.2.3 Multi-Token Prediction (EAGLE Speculative Decoding)
SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on EAGLE speculative decoding. With this optimization, decoding speed improves by up to 1.8× at batch size 1 and 1.5× at batch size 32 on H200 TP8.
Enable with:
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--speculative-algorithm EAGLE \
--trust-remote-code \
--tp 8
The default configuration is --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4. Find the best values for your workload with bench_speculative.py. The minimum viable config is --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2.
For large batch sizes (>48), increase --max-running-requests beyond the default of 48 for MTP. Also set --cuda-graph-bs to include your target batch sizes (default captured sizes for speculative decoding: 48).
The spec-v2 overlap scheduler is enabled by default (SGLANG_ENABLE_SPEC_V2=True). It improves performance by overlapping draft and verification stages. Set SGLANG_ENABLE_SPEC_V2=0 to disable.
4.2.4 MLA Optimizations
DeepSeek V3 uses Multi-head Latent Attention (MLA), an attention mechanism that improves inference efficiency. SGLang implements several optimizations:
- Weight Absorption: Reorders matrix multiplications to improve decoding phase efficiency.
- MLA Attention Backends: FA3, Flashinfer, FlashMLA, CutlassMLA, TRTLLM MLA (Blackwell), and Triton. FA3 is the default.
- FP8 Quantization: W8A8 FP8 and KV Cache FP8, with BMM operators for weight-absorbed MLA in FP8.
- CUDA Graph & Torch.compile: Both MLA and MoE support CUDA Graph and Torch.compile for reduced decoding latency.
- Chunked Prefix Cache: Increases throughput for long-sequence chunked prefill (FlashAttention3 backend only).
Overall, these optimizations achieve up to 7× output throughput improvement vs. the baseline.
Reference: See SGLang v0.3 blog and Slides for details.
4.2.5 Multi-Node Deployment
For multi-node serving and hardware-specific examples:
Blog references for large-scale deployment:
5. Benchmark
5.1 Speed Benchmark
Test Environment:
- Hardware: AMD MI300X GPU (8x)
- Model: DeepSeek-V3
- Tensor Parallelism: 8
- sglang version: 0.5.7
We use SGLang’s built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses.
5.1.1 Latency-Sensitive Benchmark
- Model Deployment Command:
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--tp 8 \
--dp 8 \
--enable-dp-attention \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--host 0.0.0.0 \
--port 8000
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 8000 \
--model deepseek-ai/DeepSeek-V3 \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 10 \
--max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 81.27
Total input tokens: 1972
Total input text tokens: 1972
Total input vision tokens: 0
Total generated tokens: 2784
Total generated tokens (retokenized): 2774
Request throughput (req/s): 0.12
Input token throughput (tok/s): 24.27
Output token throughput (tok/s): 34.26
Peak output token throughput (tok/s): 65.00
Peak concurrent requests: 2
Total token throughput (tok/s): 58.52
Concurrency: 1.00
Accept length: 2.61
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 8123.17
Median E2E Latency (ms): 7982.65
---------------Time to First Token----------------
Mean TTFT (ms): 1080.76
Median TTFT (ms): 1248.82
P99 TTFT (ms): 1896.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 25.04
Median TPOT (ms): 24.76
P99 TPOT (ms): 32.09
---------------Inter-Token Latency----------------
Mean ITL (ms): 25.41
Median ITL (ms): 20.14
P95 ITL (ms): 60.28
P99 ITL (ms): 60.99
Max ITL (ms): 61.49
==================================================
5.1.2 Throughput-Sensitive Benchmark
- Model Deployment Command:
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--tp 8 \
--ep 8 \
--dp 8 \
--enable-dp-attention \
--host 0.0.0.0 \
--port 8000
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 8000 \
--model deepseek-ai/DeepSeek-V3 \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1000 \
--max-concurrency 100
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 406.16
Total input tokens: 301701
Total input text tokens: 301701
Total input vision tokens: 0
Total generated tokens: 188375
Total generated tokens (retokenized): 187542
Request throughput (req/s): 2.46
Input token throughput (tok/s): 742.81
Output token throughput (tok/s): 463.80
Peak output token throughput (tok/s): 1299.00
Peak concurrent requests: 109
Total token throughput (tok/s): 1206.61
Concurrency: 87.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 35552.98
Median E2E Latency (ms): 21466.07
---------------Time to First Token----------------
Mean TTFT (ms): 1521.51
Median TTFT (ms): 476.80
P99 TTFT (ms): 8329.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 214.73
Median TPOT (ms): 152.00
P99 TPOT (ms): 1155.85
---------------Inter-Token Latency----------------
Mean ITL (ms): 182.10
Median ITL (ms): 79.18
P95 ITL (ms): 398.60
P99 ITL (ms): 1488.96
Max ITL (ms): 43465.60
==================================================
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000
- Test Results:
- DeepSeek-V3
Accuracy: 0.960
Invalid: 0.000
Latency: 32.450 s
Output throughput: 614.211 token/s
5.2.2 MMLU Benchmark
cd sglang
bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 10 --port 8000
- Test Results:
- DeepSeek-V3
subject: abstract_algebra, #q:100, acc: 0.800
subject: anatomy, #q:135, acc: 0.874
subject: astronomy, #q:152, acc: 0.928
subject: business_ethics, #q:100, acc: 0.880
subject: clinical_knowledge, #q:265, acc: 0.928
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.670
subject: college_computer_science, #q:100, acc: 0.840
subject: college_mathematics, #q:100, acc: 0.800
subject: college_medicine, #q:173, acc: 0.861
Total latency: 58.339
Average accuracy: 0.871