1. Model Introduction
Kimi-K2.5 is an open-source, native multimodal agentic model by Moonshot AI, built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes. Key Features:- Native Multimodality: Pre-trained on vision-language tokens, K2.5 excels in visual knowledge, cross-modal reasoning, and agentic tool use grounded in visual inputs.
- Coding with Vision: K2.5 generates code from visual specifications (UI designs, video workflows) and autonomously orchestrates tools for visual data processing.
- Agent Swarm: K2.5 transitions from single-agent scaling to a self-directed, coordinated swarm-like execution scheme. It decomposes complex tasks into parallel sub-tasks executed by dynamically instantiated, domain-specific agents.
- Speculative Decoding: EAGLE-based speculative decoding support for lower latency.
- INT4 (Initial Released): moonshotai/Kimi-K2.5
- NVFP4 (4-bit quantized): nvidia/Kimi-K2.5-NVFP4
2. SGLang Installation
Refer to the official SGLang installation guide.3. Model Deployment
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and capabilities.3.2 Configuration Tips
- Memory: Requires GPUs with >=140GB each. Supported platforms: H200 (8x, TP=8), B300 (8x, TP=8), MI300X/MI325X (4x, TP=4), MI350X/MI355X (4x, TP=4). Use
--context-length 128000to conserve memory. - AMD GPU TP Constraint: On AMD GPUs, TP must be <= 4 (not 8). Kimi-K2.5 has 64 attention heads; the AITER MLA kernel requires
heads_per_gpu % 16 == 0. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid). - AMD Docker Image: Use
lmsysorg/sglang:v0.5.9-rocm700-mi35xfor MI350X/MI355X andlmsysorg/sglang:v0.5.9-rocm700-mi30xfor MI300X/MI325X. The ROCm 7.2 images (rocm720) have an AITER compatibility issue. - DP Attention: Enable with
--dp <N> --enable-dp-attentionfor production throughput. A common choice is to set--dpequal to--tp, but this is not required. - Reasoning Parser: Add
--reasoning-parser kimi_k2to separate thinking and content in model outputs. - Tool Call Parser: Add
--tool-call-parser kimi_k2for structured tool calls.
4. Model Invocation
4.1 Basic Usage
See Basic API Usage.4.2 Advanced Usage
4.2.1 Multimodal (Vision + Text) Input
Kimi-K2.5 supports native multimodal input with images:Example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "What is in this image? Describe it in detail."
}
]
}
]
)
print(response.choices[0].message.content)
Output
This image shows a **receipt from Auntie Anne's** (a pretzel franchise restaurant).
## Key Details:
**Item Purchased:**
- **CINNAMON SUGAR** - 1 unit x 17,000 = **17,000**
**Payment Summary:**
- **SUB TOTAL:** 17,000
- **GRAND TOTAL:** 17,000
- **CASH IDR:** 20,000 (Indonesian Rupiah)
- **CHANGE DUE:** 3,000
## Context:
The receipt indicates a transaction in **Indonesian Rupiah (IDR)**. A customer purchased one Cinnamon Sugar pretzel for 17,000 IDR, paid with a 20,000 IDR note, and received 3,000 IDR in change.
The top of the receipt shows the Auntie Anne's logo (a heart-shaped pretzel with a halo), and some text appears blurred for privacy, likely obscuring the store location, date, and transaction number. The receipt is printed on white thermal paper.
4.2.2 Reasoning Output
Kimi-K2.5 supports both thinking mode (default) and instant mode. Thinking Mode (default) — reasoning content is automatically separated:Example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{"role": "user", "content": "Which one is bigger, 9.11 or 9.9? Think carefully."}
]
)
print("====== Reasoning Content (Thinking Mode) ======")
print(response.choices[0].message.reasoning_content)
print("====== Response (Thinking Mode) ======")
print(response.choices[0].message.content)
Example
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{"role": "user", "content": "Which one is bigger, 9.11 or 9.9? Think carefully."}
],
extra_body={"chat_template_kwargs": {"thinking": False}}
)
print("====== Response (Instant Mode) ======")
print(response.choices[0].message.content)
Output
====== Reasoning Content (Thinking Mode) ======
The user is asking which number is bigger: 9.11 or 9.9.
At first glance, someone might think 9.11 is bigger because 11 > 9, but that's incorrect because we're dealing with decimal numbers, not whole numbers.
Let me compare them properly:
- 9.9 = 9.90
- 9.11
When comparing decimals, we look at each place value from left to right:
- Units place: 9 = 9 (tie)
- Tenths place: 9 vs 1
Since 9 > 1, we have 9.9 > 9.11.
Alternatively, we can think of it as:
- 9.9 = 9 + 9/10 = 9 + 0.9 = 9.90
- 9.11 = 9 + 11/100 = 9 + 0.11
Since 0.90 > 0.11, then 9.9 > 9.11.
So the answer is clearly 9.9 is bigger.
The "think carefully" hint suggests the user is trying to catch the common error where people compare 11 and 9 as whole numbers rather than understanding decimal place value (tenths vs hundredths).
I should explain this clearly to avoid confusion.
====== Response (Thinking Mode) ======
**9.9 is bigger.**
Here's why this can be tricky: Many people instinctively compare 11 and 9 and think "11 is bigger than 9," but that's comparing the wrong place values.
When comparing decimals, align them by place value:
- 9.9 = 9.**90**
- 9.11 = 9.**11**
After the decimal point:
- The first digit (tenths place): **9** vs **1**
- Since 9 > 1, we stop there. **9.9 is larger.**
Think of it as money:
- $9.90 (nine dollars and ninety cents)
- $9.11 (nine dollars and eleven cents)
$9.90 is clearly more than $9.11.
====== Response (Instant Mode) ======
Let me think through this carefully.
**9.9 is bigger than 9.11**
Here's why: When comparing decimals, we need to align them by their decimal places:
- 9.9 = 9.90
- 9.11 = 9.11
Now comparing:
- The whole number parts are equal (9 = 9)
- Comparing tenths: **9 > 1**
So 9.90 > 9.11
A common mistake is thinking 11 hundredths is larger than 9 tenths, but 9 tenths = 90 hundredths, which is clearly larger than 11 hundredths.
4.2.3 Tool Calling
Kimi-K2.5 supports tool calling capabilities for agentic tasks:Example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
stream=True
)
# Process streaming response
tool_calls_accumulator = {}
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
if hasattr(delta, 'tool_calls') and delta.tool_calls:
for tool_call in delta.tool_calls:
index = tool_call.index
if index not in tool_calls_accumulator:
tool_calls_accumulator[index] = {'name': None, 'arguments': ''}
if tool_call.function:
if tool_call.function.name:
tool_calls_accumulator[index]['name'] = tool_call.function.name
if tool_call.function.arguments:
tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments
if delta.content:
print(delta.content, end="", flush=True)
for index, tool_call in sorted(tool_calls_accumulator.items()):
print(f"Tool Call: {tool_call['name']}")
print(f" Arguments: {tool_call['arguments']}")
Output
Tool Call: get_weather
Arguments: {"location": "Beijing"}
Example
# Send tool result back to the model
messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Beijing", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": "The weather in Beijing is 22°C and sunny."
}
]
final_response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=messages
)
print(final_response.choices[0].message.content)
Output
The weather in Beijing is **22°C and sunny**. ☀️
It's a nice day there with comfortable temperatures and clear skies!
4.2.4 Multimodal + Tool Calling (Agentic Vision)
Combine vision understanding with tool calling for advanced agentic tasks:Example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
tools = [
{
"type": "function",
"function": {
"name": "search_product",
"description": "Search for a product by name or description",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The product name or description to search for"
}
},
"required": ["query"]
}
}
}
]
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Can you identify this product and search for similar items?"
}
]
}
],
tools=tools
)
msg = response.choices[0].message
# Print reasoning process
if msg.reasoning_content:
print("=== Reasoning ===")
print(msg.reasoning_content)
# Print response content
if msg.content:
print("=== Content ===")
print(msg.content)
# Print tool calls
if msg.tool_calls:
print("=== Tool Calls ===")
for tc in msg.tool_calls:
print(f" Function: {tc.function.name}")
print(f" Arguments: {tc.function.arguments}")
Output
=== Reasoning ===
The user is asking me to identify a product from a receipt and search for similar items.
Looking at the receipt, I can see:
1. The store is "Auntie Anne's" - which is a popular pretzel chain
2. The product purchased is "CINNAMON SUGAR"
3. Price is 17,000 (likely Indonesian Rupiah based on "CASH IDR")
4. Quantity is 1
So the product is a Cinnamon Sugar pretzel from Auntie Anne's.
Now I need to search for this product or similar items using the search_product function.
=== Content ===
I can see from the receipt that the product is a **Cinnamon Sugar** item from **Auntie Anne's** (the famous pretzel chain). This appears to be a Cinnamon Sugar Pretzel purchased for 17,000 IDR (Indonesian Rupiah).
Let me search for this product and similar items:
=== Tool Calls ===
Function: search_product
Arguments: {"query": "Auntie Anne's Cinnamon Sugar Pretzel"}
4.2.5 Speculative Decoding
Nvidia Deploy Kimi-K2.5 with the following command (H200/B200, all features enabled):Command
SGLANG_ENABLE_SPEC_V2=1 sglang serve \
--model-path moonshotai/Kimi-K2.5 \
--tp 8 \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--speculative-algorithm=EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
Command
SGLANG_ENABLE_SPEC_V2=1 sglang serve \
--model-path nvidia/Kimi-K2.5-NVFP4 \
--tp 8 \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--kv-cache-dtype fp8_e4m3 \
--speculative-algorithm=EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
5. Benchmark
5.1 Accuracy Benchmark
5.1.1 MMMU Benchmark
You can evaluate the model’s accuracy using the MMMU benchmark, which tests multimodal understanding and reasoning across various subjects:- Benchmark Command:
Command
python3 benchmark/mmmu/bench_sglang.py \
--response-answer-regex "(?i)(?:answer|ans)[:\s]*(?:\*\*)?[\(\[]?([A-Za-z])[\)\]]?(?:\*\*)?" \
--port 30000 \
--concurrency 64
- Result:
Output
Benchmark time: 2785.4322692090645
answers saved to: ./answer_sglang.json
Evaluating...
answers saved to: ./answer_sglang.json
{'Accounting': {'acc': 0.667, 'num': 30},
'Agriculture': {'acc': 0.567, 'num': 30},
'Architecture_and_Engineering': {'acc': 0.733, 'num': 30},
'Art': {'acc': 0.833, 'num': 30},
'Art_Theory': {'acc': 0.8, 'num': 30},
'Basic_Medical_Science': {'acc': 0.833, 'num': 30},
'Biology': {'acc': 0.6, 'num': 30},
'Chemistry': {'acc': 0.633, 'num': 30},
'Clinical_Medicine': {'acc': 0.733, 'num': 30},
'Computer_Science': {'acc': 0.667, 'num': 30},
'Design': {'acc': 0.7, 'num': 30},
'Diagnostics_and_Laboratory_Medicine': {'acc': 0.5, 'num': 30},
'Economics': {'acc': 0.867, 'num': 30},
'Electronics': {'acc': 0.3, 'num': 30},
'Energy_and_Power': {'acc': 0.767, 'num': 30},
'Finance': {'acc': 0.833, 'num': 30},
'Geography': {'acc': 0.667, 'num': 30},
'History': {'acc': 0.767, 'num': 30},
'Literature': {'acc': 0.767, 'num': 30},
'Manage': {'acc': 0.733, 'num': 30},
'Marketing': {'acc': 0.833, 'num': 30},
'Materials': {'acc': 0.567, 'num': 30},
'Math': {'acc': 0.633, 'num': 30},
'Mechanical_Engineering': {'acc': 0.567, 'num': 30},
'Music': {'acc': 0.5, 'num': 30},
'Overall': {'acc': 0.698, 'num': 900},
'Overall-Art and Design': {'acc': 0.708, 'num': 120},
'Overall-Business': {'acc': 0.787, 'num': 150},
'Overall-Health and Medicine': {'acc': 0.74, 'num': 150},
'Overall-Humanities and Social Science': {'acc': 0.75, 'num': 120},
'Overall-Science': {'acc': 0.66, 'num': 150},
'Overall-Tech and Engineering': {'acc': 0.595, 'num': 210},
'Pharmacy': {'acc': 0.767, 'num': 30},
'Physics': {'acc': 0.767, 'num': 30},
'Psychology': {'acc': 0.667, 'num': 30},
'Public_Health': {'acc': 0.867, 'num': 30},
'Sociology': {'acc': 0.8, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.698
5.2 Speed Benchmark
Test Environment:- Hardware: NVIDIA H200 GPU (8x)
- Model: Kimi-K2.5
- Tensor Parallelism: 8
- SGLang Version: 0.5.6.post2
random dataset for standardized performance evaluation.
5.2.1 Latency Benchmark
- Model Deployment:
Command
sglang serve \
--model-path moonshotai/Kimi-K2.5 \
--tp 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
- Benchmark Command:
Command
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
- Results:
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 39.77
Total input tokens: 6101
Total input text tokens: 6101
Total generated tokens: 4220
Total generated tokens (retokenized): 4221
Request throughput (req/s): 0.25
Input token throughput (tok/s): 153.40
Output token throughput (tok/s): 106.10
Peak output token throughput (tok/s): 156.00
Peak concurrent requests: 2
Total token throughput (tok/s): 259.50
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3972.87
Median E2E Latency (ms): 4044.55
P90 E2E Latency (ms): 7046.30
P99 E2E Latency (ms): 7441.13
---------------Time to First Token----------------
Mean TTFT (ms): 176.89
Median TTFT (ms): 154.24
P99 TTFT (ms): 285.75
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 9.22
Median TPOT (ms): 9.32
P99 TPOT (ms): 12.72
---------------Inter-Token Latency----------------
Mean ITL (ms): 9.02
Median ITL (ms): 8.80
P95 ITL (ms): 13.23
P99 ITL (ms): 14.17
Max ITL (ms): 29.38
==================================================
- Medium Concurrency (Balanced)
Command
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 158.05
Total input tokens: 39668
Total input text tokens: 39668
Total generated tokens: 40805
Total generated tokens (retokenized): 40775
Request throughput (req/s): 0.51
Input token throughput (tok/s): 250.99
Output token throughput (tok/s): 258.18
Peak output token throughput (tok/s): 1103.00
Peak concurrent requests: 19
Total token throughput (tok/s): 509.17
Concurrency: 14.09
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 27837.05
Median E2E Latency (ms): 23508.00
P90 E2E Latency (ms): 57126.31
P99 E2E Latency (ms): 66044.35
---------------Time to First Token----------------
Mean TTFT (ms): 374.30
Median TTFT (ms): 375.51
P99 TTFT (ms): 695.58
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 53.25
Median TPOT (ms): 57.93
P99 TPOT (ms): 85.45
---------------Inter-Token Latency----------------
Mean ITL (ms): 53.95
Median ITL (ms): 53.97
P95 ITL (ms): 84.74
P99 ITL (ms): 244.84
Max ITL (ms): 655.61
==================================================
- High Concurrency (Throughput-Optimized)
Command
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100 \
--request-rate inf
- Results:
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 500
Benchmark duration (s): 996.64
Total input tokens: 249831
Total input text tokens: 249831
Total generated tokens: 252662
Total generated tokens (retokenized): 252588
Request throughput (req/s): 0.50
Input token throughput (tok/s): 250.67
Output token throughput (tok/s): 253.51
Peak output token throughput (tok/s): 1199.00
Peak concurrent requests: 104
Total token throughput (tok/s): 504.18
Concurrency: 92.70
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 184773.75
Median E2E Latency (ms): 174183.65
P90 E2E Latency (ms): 343625.28
P99 E2E Latency (ms): 404284.53
---------------Time to First Token----------------
Mean TTFT (ms): 1289.59
Median TTFT (ms): 1313.35
P99 TTFT (ms): 2346.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 364.70
Median TPOT (ms): 403.32
P99 TPOT (ms): 452.34
---------------Inter-Token Latency----------------
Mean ITL (ms): 363.82
Median ITL (ms): 316.21
P95 ITL (ms): 745.91
P99 ITL (ms): 1345.88
Max ITL (ms): 3118.59
==================================================
- Low Concurrency
Command
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 680.26
Total input tokens: 6101
Total input text tokens: 6101
Total generated tokens: 44462
Total generated tokens (retokenized): 44455
Request throughput (req/s): 0.01
Input token throughput (tok/s): 8.97
Output token throughput (tok/s): 65.36
Peak output token throughput (tok/s): 151.00
Peak concurrent requests: 2
Total token throughput (tok/s): 74.33
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 68019.29
Median E2E Latency (ms): 70568.85
P90 E2E Latency (ms): 113237.40
P99 E2E Latency (ms): 121682.34
---------------Time to First Token----------------
Mean TTFT (ms): 206.17
Median TTFT (ms): 177.28
P99 TTFT (ms): 445.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.36
Median TPOT (ms): 15.89
P99 TPOT (ms): 16.43
---------------Inter-Token Latency----------------
Mean ITL (ms): 15.26
Median ITL (ms): 15.85
P95 ITL (ms): 17.50
P99 ITL (ms): 23.21
Max ITL (ms): 45.22
==================================================
- Medium Concurrency
Command
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 2475.98
Total input tokens: 39668
Total input text tokens: 39668
Total generated tokens: 318306
Total generated tokens (retokenized): 318166
Request throughput (req/s): 0.03
Input token throughput (tok/s): 16.02
Output token throughput (tok/s): 128.56
Peak output token throughput (tok/s): 847.00
Peak concurrent requests: 18
Total token throughput (tok/s): 144.58
Concurrency: 14.62
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 452592.46
Median E2E Latency (ms): 486002.05
P90 E2E Latency (ms): 833197.57
P99 E2E Latency (ms): 957399.48
---------------Time to First Token----------------
Mean TTFT (ms): 359.38
Median TTFT (ms): 350.78
P99 TTFT (ms): 500.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 111.18
Median TPOT (ms): 122.76
P99 TPOT (ms): 145.90
---------------Inter-Token Latency----------------
Mean ITL (ms): 113.69
Median ITL (ms): 122.81
P95 ITL (ms): 147.87
P99 ITL (ms): 151.03
Max ITL (ms): 272.05
==================================================
- High Concurrency
Command
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 320 \
--max-concurrency 64 \
--request-rate inf
Output
Waiting for completion...
- Low Concurrency
Command
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 120.73
Total input tokens: 41941
Total input text tokens: 41941
Total generated tokens: 4220
Total generated tokens (retokenized): 4220
Request throughput (req/s): 0.08
Input token throughput (tok/s): 347.41
Output token throughput (tok/s): 34.96
Peak output token throughput (tok/s): 73.00
Peak concurrent requests: 2
Total token throughput (tok/s): 382.36
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 12068.56
Median E2E Latency (ms): 10211.36
P90 E2E Latency (ms): 23203.32
P99 E2E Latency (ms): 30677.66
---------------Time to First Token----------------
Mean TTFT (ms): 1625.64
Median TTFT (ms): 1526.63
P99 TTFT (ms): 3743.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 24.95
Median TPOT (ms): 23.95
P99 TPOT (ms): 35.40
---------------Inter-Token Latency----------------
Mean ITL (ms): 24.80
Median ITL (ms): 21.73
P95 ITL (ms): 59.56
P99 ITL (ms): 61.10
Max ITL (ms): 62.70
==================================================
- Medium Concurrency
Command
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 389.96
Total input tokens: 300020
Total input text tokens: 300020
Total generated tokens: 41669
Total generated tokens (retokenized): 41670
Request throughput (req/s): 0.21
Input token throughput (tok/s): 769.36
Output token throughput (tok/s): 106.86
Peak output token throughput (tok/s): 304.00
Peak concurrent requests: 19
Total token throughput (tok/s): 876.22
Concurrency: 14.95
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 72870.97
Median E2E Latency (ms): 70495.88
P90 E2E Latency (ms): 121820.46
P99 E2E Latency (ms): 148933.09
---------------Time to First Token----------------
Mean TTFT (ms): 2460.45
Median TTFT (ms): 1976.29
P99 TTFT (ms): 7305.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 140.57
Median TPOT (ms): 142.31
P99 TPOT (ms): 273.40
---------------Inter-Token Latency----------------
Mean ITL (ms): 135.44
Median ITL (ms): 95.96
P95 ITL (ms): 152.93
P99 ITL (ms): 1488.37
Max ITL (ms): 6540.24
==================================================
- High Concurrency
Command
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 320 \
--max-concurrency 64 \
--request-rate inf
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 1279.50
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169981
Request throughput (req/s): 0.25
Input token throughput (tok/s): 995.62
Output token throughput (tok/s): 132.86
Peak output token throughput (tok/s): 703.00
Peak concurrent requests: 67
Total token throughput (tok/s): 1128.49
Concurrency: 60.12
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 240385.63
Median E2E Latency (ms): 236266.30
P90 E2E Latency (ms): 429882.12
P99 E2E Latency (ms): 515158.36
---------------Time to First Token----------------
Mean TTFT (ms): 2710.44
Median TTFT (ms): 2345.63
P99 TTFT (ms): 7144.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 443.84
Median TPOT (ms): 493.29
P99 TPOT (ms): 606.19
---------------Inter-Token Latency----------------
Mean ITL (ms): 448.23
Median ITL (ms): 296.17
P95 ITL (ms): 1869.15
P99 ITL (ms): 2708.95
Max ITL (ms): 7778.47
==================================================
5.2.2 Speculative Decoding Benchmark
- Model Deployment:
Command
SGLANG_ENABLE_SPEC_V2=1 sglang serve \
--model-path moonshotai/Kimi-K2.5 \
--tp 8 \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--speculative-algorithm=EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
- Benchmark Command:
Command
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
- Results:
Output
Pending update...
- Medium Concurrency (Balanced)
Command
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
Output
Pending update...
- High Concurrency (Throughput-Optimized)
Command
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100 \
--request-rate inf
Output
Pending update...
5.3 Speed Benchmark (AMD MI350X)
Test Environment:- Hardware: AMD Instinct MI350X GPU (4x)
- Model: Kimi-K2.5 (BF16)
- Tensor Parallelism: 4
- SGLang Version: 0.5.9
- Docker Image:
lmsysorg/sglang:v0.5.9-rocm700-mi35x - ROCm: 7.0
random dataset for standardized performance evaluation.
:::info AMD GPU TP Constraint
Kimi-K2.5 requires TP <= 4 on AMD GPUs. The model has 64 attention heads, and the AITER MLA kernel requires heads_per_gpu % 16 == 0. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid).
:::
5.3.1 Latency Benchmark
- Model Deployment:
Command
SGLANG_USE_AITER=1 SGLANG_ROCM_FUSED_DECODE_MLA=0 \
sglang serve \
--model-path moonshotai/Kimi-K2.5 \
--tp 4 \
--mem-fraction-static 0.8 \
--trust-remote-code \
--reasoning-parser kimi_k2 \
--host 0.0.0.0 \
--port 30000
- Benchmark Command:
Command
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
- Results:
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 155.81
Total input tokens: 6101
Total input text tokens: 6101
Total generated tokens: 4220
Total generated tokens (retokenized): 4222
Request throughput (req/s): 0.06
Input token throughput (tok/s): 39.16
Output token throughput (tok/s): 27.09
Peak output token throughput (tok/s): 29.00
Peak concurrent requests: 2
Total token throughput (tok/s): 66.24
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 15576.22
Median E2E Latency (ms): 12539.80
P90 E2E Latency (ms): 28150.56
P99 E2E Latency (ms): 34873.51
---------------Time to First Token----------------
Mean TTFT (ms): 563.50
Median TTFT (ms): 594.92
P99 TTFT (ms): 830.31
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 35.61
Median TPOT (ms): 35.66
P99 TPOT (ms): 35.77
---------------Inter-Token Latency----------------
Mean ITL (ms): 35.66
Median ITL (ms): 35.69
P95 ITL (ms): 35.96
P99 ITL (ms): 36.13
Max ITL (ms): 36.92
==================================================
- Medium Concurrency (Balanced)
Command
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
Output
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 526.66
Total input tokens: 39668
Total input text tokens: 39668
Total generated tokens: 40805
Total generated tokens (retokenized): 40798
Request throughput (req/s): 0.15
Input token throughput (tok/s): 75.32
Output token throughput (tok/s): 77.48
Peak output token throughput (tok/s): 96.00
Peak concurrent requests: 18
Total token throughput (tok/s): 152.80
Concurrency: 14.59
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 96023.27
Median E2E Latency (ms): 93940.20
P90 E2E Latency (ms): 159449.54
P99 E2E Latency (ms): 194706.61
---------------Time to First Token----------------
Mean TTFT (ms): 989.08
Median TTFT (ms): 886.42
P99 TTFT (ms): 1543.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 191.04
Median TPOT (ms): 195.20
P99 TPOT (ms): 238.84
---------------Inter-Token Latency----------------
Mean ITL (ms): 186.68
Median ITL (ms): 183.82
P95 ITL (ms): 189.90
P99 ITL (ms): 673.64
Max ITL (ms): 1633.20
==================================================
