DeepSeek-V4 - SGLang Documentation

1. Model Introduction

DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:

Variant	Total params	Active (MoE)	Use
DeepSeek-V4-Flash	284B	13B	single-node serving: B200 / GB200 / GB300 / H200 on 4 GPUs
DeepSeek-V4-Pro	1.6T	49B	high-capacity: B200 8 GPU / GB200 8 GPU (2 nodes) / GB300 4 GPU / H200 8 GPU (FP4) or 16 GPU (SGLang FP8)

The Instruct repos ship FP4 MoE experts + FP8 attention / dense (one mixed-precision checkpoint covers all GPUs that support FP4). The Base (pre-trained only) variants — DeepSeek-V4-Flash-Base, DeepSeek-V4-Pro-Base — ship pure FP8 mixed and are not for chat / tool calling. Key Features (per the official model card):

Hybrid Attention Architecture — combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for long-context efficiency. At 1M-token context, DeepSeek-V4-Pro uses only ~27% of per-token inference FLOPs and ~10% of KV cache compared with DeepSeek-V3.2.
Manifold-Constrained Hyper-Connections (mHC) — strengthens residual connections, improving signal-propagation stability across layers while preserving expressivity.
Muon optimizer — faster convergence and greater training stability.
Context length: 1M tokens; pre-trained on 32T+ diverse, high-quality tokens.
Three reasoning modes: Non-think (fast, intuitive responses), Think High (conscious logical analysis, slower but more accurate), Think Max (push reasoning to its fullest extent). Recommend a ≥ 384K context window when running Think Max.
Ships with a dedicated encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar (<｜DSML｜tool_calls> / <｜DSML｜invoke> / <｜DSML｜parameter>).

Recommended Generation Parameters: temperature=1.0, top_p=1.0 (per the official model card). License: MIT. Resources:

HuggingFace: DeepSeek-V4-Flash, DeepSeek-V4-Pro
ModelScope: DeepSeek-V4-Flash, DeepSeek-V4-Pro

2. SGLang Installation

SGLang offers multiple installation methods. Choose based on your hardware platform. Please refer to the official SGLang installation guide for installation instructions. Docker Image: Use lmsysorg/sglang:latest for all supported hardware platforms (B300 / B200 / GB200 / GB300 / H200 / H100).

Command

docker pull lmsysorg/sglang:latest

For how to actually launch the image, see Install → Method 3: Using Docker. A minimal example (substitute the inner sglang serve ... with whatever the command generator below produces):

Command

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your-hf-token>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    sglang serve <use args below>

3. Model Deployment

SGLang supports three main serving recipes for DeepSeek-V4 with different latency/throughput trade-offs (low-latency, balanced, max-throughput), plus specialized recipes for long-context (cp, prefill context-parallel) and prefill/decode disaggregation (pd-disagg). The interactive generator below emits the exact launch command for any (hardware, variant, recipe) combination.

3.1 Basic Configuration

Interactive Command Generator: Use the selector below to generate the deployment command for your hardware + recipe combination.

3.2 Configuration Tips

Concurrency & DeepEP dispatch buffer Must hold: max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP’s dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together. The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload’s peak concurrency and report findings back so the defaults can be revised. MTP (Multi-Token Prediction, EAGLE)

low-latency: steps=3, draft-tokens=4 → largest win at bs=1.
balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.
max-throughput: MTP disabled — at saturation the verify step costs more than it saves.
MTP currently requires SGLANG_ENABLE_SPEC_V2=1.

Hopper (H200) note We provide two different options for running DeepSeek-V4 models on Hopper devices (H200)

Original FP4 checkpoints: To run original FP4 checkpoints, we provide two different options for w4a16 MoE kernels: Marlin (--moe-runner-backend marlin) and Flashinfer (--moe-runner-backend flashinfer_mxfp4). For this variant we only support Tensor Parallelism. Complete Pro model can be run on a single H200 node with this option.
Converted FP8 checkpoints: We also provide pre-converted FP8 checkpoints (sgl-project/DeepSeek-V4-Flash-FP8, sgl-project/DeepSeek-V4-Pro-FP8), which support more parallelism and features.

PD-Disagg recipes on H200 may require docker run --privileged --ulimit memlock=-1 (or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake can discover the IB HCAs; without IB exposure mooncake silently falls back to TCP, which can lead to garbled KV transfer on large checkpoints. MegaMoE MegaMoE fuses expert dispatch + GEMM into a single kernel for higher throughput on MoE layers. To enable it, use the MegaMoE toggle in the command generator above — the generator will swap --moe-a2a-backend deepep for --moe-a2a-backend megamoe and add the relevant env vars automatically. Two variants are exposed:

W4A8 — default MegaMoE kernel (FP4 weights, FP8 activations).
W4A4 — adds SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1 and SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND=1 to run the custom W4A4 kernel (FP4 activations). Higher throughput with negligible accuracy drop (~89.5 GPQA on Pro).

Notes:

MegaMoE is not supported on Hopper (H100 / H200) nor on the low-latency / balanced / cp settings — it is only wired into the max-throughput recipe on Blackwell. When running MegaMoE, don’t set --moe-runner-backend manually.
Adjust SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK based on your workload and memory usage. Setting higher number of tokens for MegaMoE requires more HBM space. (recommended: 8320 for max-throughput).

GB300 PD-Disagg cross-pod MNNVL On some GB300 clusters with cross-pod KV transfer over NVLink, mooncake may fail with nvlink_transport.cpp:497 Requested address ... not found!. If this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1 to both prefill and decode sglang serve commands.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, see:

Basic API Usage

Once the server is running (for example via the command generator above), send a request:

Command

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "What is 15% of 240?"}]
  }'

PD-Disagg note: if you deployed with the pd-disagg recipe from the generator above, the prefill server is on port 30000, the decode server on 30001, and the router on port 8000 — client traffic should target http://localhost:8000, not :30000.

4.2 Advanced Usage

4.2.1 Reasoning Parser

Enable the deepseek-v4 reasoning parser (check the box in the command panel above) to separate thinking from the final answer into reasoning_content vs content.

Streaming with Thinking Process (Python)

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    max_tokens=2048,
    extra_body={"chat_template_kwargs": {"thinking": True}},
    stream=True,
)

thinking_started = False
has_thinking = False
has_answer = False

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if getattr(delta, "reasoning_content", None):
        if not thinking_started:
            print("=============== Thinking =================", flush=True)
            thinking_started = True
        has_thinking = True
        print(delta.reasoning_content, end="", flush=True)

    if delta.content:
        if has_thinking and not has_answer:
            print("\n=============== Content =================", flush=True)
            has_answer = True
        print(delta.content, end="", flush=True)

print()

Example Output

Output

We are asked: "What is 15% of 240?" This is a simple percentage problem. I need to provide a step-by-step solution. The user wants the solution explained step by step. I'll calculate 15% of 240: 0.15 * 240 = 36. I'll break it down into steps: understand what percent means, convert percentage to decimal or fraction, then multiply. I'll present the answer clearly.</think>To find 15% of 240, follow these steps:

**Step 1: Understand the meaning of percent**
"Percent" means "per hundred," so 15% means 15 out of every100, or \( \frac{15}{100} \).

**Step2: Convert the percentage to a decimal or fraction**
\( 15\% = \frac{15}{100} = 0.15 \)

**Step3: Multiply by the given number**
Multiply the decimal form by 240:
\( 0.15 \times 240 \)

**Step4: Perform the multiplication**
\( 0.15 \times 240 = 36 \)

**Answer:** 15% of 240 is **36**.

4.2.2 Tool Calling

Enable the deepseekv4 tool-call parser (check the box in the command panel above) to surface structured tool calls via message.tool_calls.

Python Example with Thinking Process

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools,
    extra_body={"chat_template_kwargs": {"thinking": True}},
    stream=True,
)

thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if getattr(delta, "reasoning_content", None):
        if not thinking_started:
            print("=============== Thinking =================", flush=True)
            thinking_started = True
        has_thinking = True
        print(delta.reasoning_content, end="", flush=True)

    if getattr(delta, "tool_calls", None):
        if has_thinking and thinking_started:
            print("\n=============== Content =================\n", flush=True)
            thinking_started = False
        for tool_call in delta.tool_calls:
            index = tool_call.index
            if index not in tool_calls_accumulator:
                tool_calls_accumulator[index] = {"name": None, "arguments": ""}
            if tool_call.function:
                if tool_call.function.name:
                    tool_calls_accumulator[index]["name"] = tool_call.function.name
                if tool_call.function.arguments:
                    tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments

    if delta.content:
        print(delta.content, end="", flush=True)

for index, tool_call in sorted(tool_calls_accumulator.items()):
    print(f"Tool Call: {tool_call['name']}")
    print(f"   Arguments: {tool_call['arguments']}")

print()

Example Output

Output

The user wants to know the weather in Beijing. I'll use the get_weather function with Beijing as the location. I don't need to specify a unit, so I'll just use the default.</think>

<｜DSML｜tool_calls>
<｜DSML｜invoke name="get_weather">
<｜DSML｜parameter name="location" string="true">Beijing</｜DSML｜parameter>
</｜DSML｜invoke>
</｜DSML｜tool_calls>

4.2.3 HiCache (Hierarchical KV Caching)

HiCache enables multi-tier KV cache offloading (GPU → CPU → Storage), significantly expanding effective context capacity for long-context and multi-turn scenarios. Combined with UnifiedRadixTree, it provides intelligent prefix caching across all tiers. To enable HiCache, use the HiCache toggle in the command generator above:

L2 (GPU + CPU): Offloads cold KV pages to CPU memory. Enables SGLANG_ENABLE_UNIFIED_RADIX_TREE=1 for intelligent hierarchical prefix caching.
L3 (GPU + CPU + Storage): Coming soon.

For more details, see the HiCache documentation.

5. Benchmark

5.1 Accuracy Benchmark

For accuracy benchmarking on DeepSeek-V4 models, please make sure that:

SGLANG_DEFAULT_THINKING=1 SGLANG_REASONING_EFFORT=max are set when launching model.
For GPQA and AIME25 benchmarks, run at least 16 turns to reduce randomness.

5.1.1 GSM8K Benchmark

Benchmark Command:

Command

python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 30000

Test Results:
- DeepSeek-V4-Pro (FP4, B300, low-latency)
  Accuracy: 0.965 Invalid: 0.000
- DeepSeek-V4-Pro (FP4, H200, low-latency)
  Accuracy: 0.975 Invalid: 0.000

5.1.2 GPQA Diamond Benchmark

For GPQA Diamond benchmark, we recommend applying sgl-eval as the benchmark tool.

Command

# Install
pip install git+https://github.com/sgl-project/sgl-eval

# For Flash model, reference accuracy: 88.1%
sgl-eval run gpqa --model deepseek-ai/DeepSeek-V4-Flash --api-key <api-key> --n-repeats 16 --max-tokens 200000 --temperature 1.0 --top-p 1.0 --thinking --out-dir /sgl-workspace/logs --base-url http://localhost:30000/v1

# For Pro model, reference accuracy: 90.1%
sgl-eval run gpqa --model deepseek-ai/DeepSeek-V4-Pro --api-key <api-key> --n-repeats 16 --max-tokens 400000 --temperature 1.0 --top-p 1.0 --thinking --out-dir /sgl-workspace/logs --base-url http://localhost:30000/v1

5.1.3 AIME25 Benchmark

For AIME25 benchmark, we recommend applying sgl-eval as the benchmark tool.

Command

# Install
pip install git+https://github.com/sgl-project/sgl-eval

# For Flash model, reference accuracy: ~95%
sgl-eval run aime25 --model deepseek-ai/DeepSeek-V4-Flash --api-key <api-key> --n-repeats 16 --max-tokens 200000 --temperature 1.0 --top-p 1.0 --thinking --out-dir /sgl-workspace/logs --base-url http://localhost:30000/v1

# For Pro model, reference accuracy: ~97.5%
sgl-eval run aime25 --model deepseek-ai/DeepSeek-V4-Pro --api-key <api-key> --n-repeats 16 --max-tokens 400000 --temperature 1.0 --top-p 1.0 --thinking --out-dir /sgl-workspace/logs --base-url http://localhost:30000/v1

5.2 Speed Benchmark

We use SGLang’s built-in benchmarking tool with its random dataset — real prompts sampled from ShareGPT_Vicuna_unfiltered and then truncated/padded to a controlled length. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses.

5.2.1 Hopper

Test Environment:

Hardware: NVIDIA H200 GPU (4x)
Model: DeepSeek-V4-Flash (FP4)
Tensor Parallelism: 4
sglang version: 0.5.12

Latency-Sensitive Benchmark

Model Deployment Command: H200 · DeepSeek-V4-Flash · FP4 · Low-Latency. See the command panel above.
Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 10 \
  --max-concurrency 1

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  15.98
Total input tokens:                      6101
Total input text tokens:                 6101
Total generated tokens:                  4220
Total generated tokens (retokenized):    4220
Request throughput (req/s):              0.63
Input token throughput (tok/s):          381.86
Output token throughput (tok/s):         264.13
Peak output token throughput (tok/s):    324.00
Peak concurrent requests:                3
Total token throughput (tok/s):          645.98
Concurrency:                             1.00
Accept length:                           2.96
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1596.65
Median E2E Latency (ms):                 1274.48
P90 E2E Latency (ms):                    2950.70
P99 E2E Latency (ms):                    3333.18
---------------Time to First Token----------------
Mean TTFT (ms):                          147.26
Median TTFT (ms):                        132.22
P99 TTFT (ms):                           181.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.50
Median TPOT (ms):                        3.48
P99 TPOT (ms):                           4.18
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.44
Median ITL (ms):                         3.36
P95 ITL (ms):                            5.06
P99 ITL (ms):                            5.15
Max ITL (ms):                            35.31
==================================================

Throughput-Sensitive Benchmark

Model Deployment Command: H200 · DeepSeek-V4-Flash · FP4 · Max-Throughput. See the command panel above.
Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  198.42
Total input tokens:                      512842
Total input text tokens:                 512842
Total generated tokens:                  510855
Total generated tokens (retokenized):    510765
Request throughput (req/s):              5.04
Input token throughput (tok/s):          2584.65
Output token throughput (tok/s):         2574.64
Peak output token throughput (tok/s):    4400.00
Peak concurrent requests:                110
Total token throughput (tok/s):          5159.28
Concurrency:                             96.21
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   19090.29
Median E2E Latency (ms):                 18328.71
P90 E2E Latency (ms):                    35698.68
P99 E2E Latency (ms):                    39161.43
---------------Time to First Token----------------
Mean TTFT (ms):                          302.41
Median TTFT (ms):                        131.35
P99 TTFT (ms):                           2172.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.46
Median TPOT (ms):                        37.72
P99 TPOT (ms):                           55.72
---------------Inter-Token Latency----------------
Mean ITL (ms):                           36.85
Median ITL (ms):                         21.75
P95 ITL (ms):                            107.64
P99 ITL (ms):                            134.58
Max ITL (ms):                            1930.74
==================================================

5.2.2 Blackwell

Test Environment:

Hardware: NVIDIA B200 GPU (4x)
Model: DeepSeek-V4-Flash (FP4)
Tensor Parallelism: 4
sglang version: 0.5.12

Latency-Sensitive Benchmark

Model Deployment Command: B200 · DeepSeek-V4-Flash · FP4 · Low-Latency. See the command panel above.
Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 10 \
  --max-concurrency 1

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  15.25
Total input tokens:                      6101
Total input text tokens:                 6101
Total generated tokens:                  4220
Total generated tokens (retokenized):    4220
Request throughput (req/s):              0.66
Input token throughput (tok/s):          400.06
Output token throughput (tok/s):         276.72
Peak output token throughput (tok/s):    308.00
Peak concurrent requests:                2
Total token throughput (tok/s):          676.78
Concurrency:                             1.00
Accept length:                           2.73
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1523.83
Median E2E Latency (ms):                 1173.50
P90 E2E Latency (ms):                    2770.33
P99 E2E Latency (ms):                    3233.82
---------------Time to First Token----------------
Mean TTFT (ms):                          102.72
Median TTFT (ms):                        85.94
P99 TTFT (ms):                           134.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.40
Median TPOT (ms):                        3.42
P99 TPOT (ms):                           4.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.38
Median ITL (ms):                         3.06
P95 ITL (ms):                            4.60
P99 ITL (ms):                            4.95
Max ITL (ms):                            34.64
==================================================

Throughput-Sensitive Benchmark

Model Deployment Command: B200 · DeepSeek-V4-Flash · FP4 · Max-Throughput (MegaMoE W4A4). See the command panel above — flip the MegaMoE toggle to W4A4 to reproduce these numbers; the default Max-Throughput recipe uses --moe-a2a-backend deepep and runs slower.
Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  105.10
Total input tokens:                      512842
Total input text tokens:                 512842
Total generated tokens:                  510855
Total generated tokens (retokenized):    510682
Request throughput (req/s):              9.51
Input token throughput (tok/s):          4879.44
Output token throughput (tok/s):         4860.54
Peak output token throughput (tok/s):    6600.00
Peak concurrent requests:                117
Total token throughput (tok/s):          9739.98
Concurrency:                             94.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   9915.50
Median E2E Latency (ms):                 9521.19
P90 E2E Latency (ms):                    17726.66
P99 E2E Latency (ms):                    24910.72
---------------Time to First Token----------------
Mean TTFT (ms):                          349.95
Median TTFT (ms):                        68.23
P99 TTFT (ms):                           4581.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.86
Median TPOT (ms):                        17.96
P99 TPOT (ms):                           61.58
---------------Inter-Token Latency----------------
Mean ITL (ms):                           18.76
Median ITL (ms):                         13.23
P95 ITL (ms):                            44.79
P99 ITL (ms):                            88.25
Max ITL (ms):                            2499.49
==================================================

Documentation Index

​1. Model Introduction

​2. SGLang Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Basic Usage

​4.2 Advanced Usage

​4.2.1 Reasoning Parser

​4.2.2 Tool Calling

​4.2.3 HiCache (Hierarchical KV Caching)

​5. Benchmark

​5.1 Accuracy Benchmark

​5.1.1 GSM8K Benchmark

​5.1.2 GPQA Diamond Benchmark

​5.1.3 AIME25 Benchmark

​5.2 Speed Benchmark

​5.2.1 Hopper

Latency-Sensitive Benchmark

Throughput-Sensitive Benchmark

​5.2.2 Blackwell

Latency-Sensitive Benchmark

Throughput-Sensitive Benchmark

1. Model Introduction

2. SGLang Installation

3. Model Deployment

3.1 Basic Configuration

3.2 Configuration Tips

4. Model Invocation

4.1 Basic Usage

4.2 Advanced Usage

4.2.1 Reasoning Parser

4.2.2 Tool Calling

4.2.3 HiCache (Hierarchical KV Caching)

5. Benchmark

5.1 Accuracy Benchmark

5.1.1 GSM8K Benchmark

5.1.2 GPQA Diamond Benchmark

5.1.3 AIME25 Benchmark

5.2 Speed Benchmark

5.2.1 Hopper

5.2.2 Blackwell