Kimi-K2.7-Code - SGLang Documentation

1. Model Introduction

Kimi-K2.7-Code is a coding-focused agentic model by Moonshot AI, built on top of Kimi-K2.6. It improves real-world long-horizon coding task completion while reducing thinking-token usage by approximately 30% compared with Kimi-K2.6. Key Features:

Coding-Focused Agentic Model: Optimized for end-to-end coding workflows and complex software engineering tasks.
Token Efficiency: Reduces thinking-token usage by approximately 30% versus Kimi-K2.6.
K2.6-Compatible Deployment: Shares the same architecture as Kimi-K2.5/Kimi-K2.6, so the SGLang deployment method can be reused with the new model ID.
Native Multimodality: Shares Kimi-K2.6’s native multimodal architecture with a MoonViT vision encoder (400M parameters) and supports image and video (experimental) input.

Benchmarks:

Benchmark	Kimi-K2.6	Kimi-K2.7-Code
Kimi Code Bench v2	50.9	62.0
Program Bench	48.3	53.6
MLS Bench Lite	26.7	35.1
Kimi Claw 24/7 Bench	42.9	46.9
MCP Atlas	69.4	76.0
MCP Mark Verified	72.8	81.1

Recommended Generation Parameters:

Thinking Mode: temperature=1.0, top_p=0.95
Kimi-K2.7-Code forces thinking and preserve-thinking behavior; instant mode is not supported.

Available Models:

INT4 (native checkpoint): moonshotai/Kimi-K2.7-Code

License: Modified MIT for the native checkpoint. For details, see the official model card.

2. SGLang Installation

Refer to the official SGLang installation guide.

3. Model Deployment

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and capabilities.

3.2 Configuration Tips

Memory: Requires GPUs with ≥140GB each. The native INT4 checkpoint supports H200 (8×, TP=8), B300 (8×, TP=8), GB300 (4×, TP=4), MI300X/MI325X (4×, TP=4), and MI350X/MI355X (4×, TP=4). Use --context-length 128000 to conserve memory.
Context Length: The model supports a 256K context length. Use a shorter --context-length when you need to reserve memory for larger batches.
Transformers Version: The model card requires transformers>=4.57.1,<5.0.0.
AMD GPU TP Constraint: On AMD GPUs, TP must be ≤ 4 (not 8). Kimi-K2.7-Code has 64 attention heads; the AITER MLA kernel requires heads_per_gpu % 16 == 0. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid).
AMD Docker Image: Use lmsysorg/sglang:v0.5.9-rocm700-mi35x for MI350X/MI355X and lmsysorg/sglang:v0.5.9-rocm700-mi30x for MI300X/MI325X.
DP Attention: Enable with --dp <N> --enable-dp-attention for production throughput. A common choice is to set --dp equal to --tp, but this is not required.
Reasoning Parser: Add --reasoning-parser kimi_k2 to separate thinking and content in model outputs.
Tool Call Parser: Add --tool-call-parser kimi_k2 for structured tool calls.
AMD FP8 KV Cache: On AMD platforms the generator adds --kv-cache-dtype fp8_e4m3 by default and sets --mem-fraction-static 0.8 to fit the INT4 weights plus KV cache. FP8 KV cache trades a small amount of accuracy for memory; omit the flag if you observe accuracy regressions on your workload.

4. Model Invocation

4.1 Basic Usage

See Basic API Usage.

4.2 Advanced Usage

4.2.1 Multimodal (Vision + Text) Input

Kimi-K2.7-Code supports native multimodal input with images:

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.7-Code",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                    }
                },
                {
                    "type": "text",
                    "text": "What is in this image? Describe it in detail."
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

Output Example:

Output

This image shows a **paper receipt from Auntie Anne's**, the pretzel chain restaurant. Here's a detailed breakdown:

## Header
- At the top left is the Auntie Anne's logo (a pretzel with a halo)
- The store name "**Auntie Anne's**" is printed prominently at the top
- Some text below the store name appears blurred/redacted (likely store location, address, or transaction details)

## Purchase Details
- **Item**: CINNAMON SUGAR
- **Quantity & Price**: 1 × 17,000
- **Item Total**: 17,000

## Financial Summary
- **SUB TOTAL**: 17,000
- **GRAND TOTAL**: 17,000
- **CASH IDR**: 20,000 (customer paid 20,000 Indonesian Rupiah)
- **CHANGE DUE**: 3,000

## Physical Description
- The receipt is printed on white thermal paper
- Some information in the middle section and toward the bottom is intentionally blurred/obscured
- The paper appears slightly curved/wrinkled and is placed on a dark brown surface (likely a table or counter)

The transaction is in **Indonesian Rupiah (IDR)**, indicating this purchase was made at an Auntie Anne's location in Indonesia. The customer bought one Cinnamon Sugar pretzel for 17,000 IDR and received 3,000 IDR in change after paying with 20,000 IDR cash.

4.2.2 Reasoning Output

Kimi-K2.7-Code forces thinking mode and preserve-thinking behavior. Thinking Mode (default) — reasoning content is automatically separated:

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.7-Code",
    messages=[
        {"role": "user", "content": "Which one is bigger, 9.11 or 9.9? Think carefully."}
    ]
)

print("====== Reasoning Content (Thinking Mode) ======")
print(response.choices[0].message.reasoning_content)
print("====== Response (Thinking Mode) ======")
print(response.choices[0].message.content)

Output Example:

Output

====== Reasoning Content (Thinking Mode) ======
The user is asking which number is bigger: 9.11 or 9.9. This seems straightforward, but there's a viral internet debate about this due to decimal confusion.

Let me think carefully:
- 9.11 means 9 + 11/100 = 9.11
- 9.9 means 9 + 9/10 = 9.90

So 9.9 = 9.90, and 9.90 > 9.11 because 0.90 > 0.11.

The confusion often comes from people thinking of software versioning (where 9.11 comes after 9.9) or comparing the numbers after the decimal as whole numbers (11 vs 9, thinking 11 > 9).

So mathematically, 9.9 is clearly bigger. 9.9 - 9.11 = 0.79.

I should explain this clearly and address the common misconception.
====== Response (Thinking Mode) ======
Mathematically, **9.9 is bigger**.

Here's why:

**9.9 = 9.90**

When comparing decimals, you need to look at the same place values:
- 9.11 = 9 ones, 1 tenth, and 1 hundredth
- 9.9 = 9 ones, 9 tenths, and 0 hundredths (9.90)

Since **0.90 > 0.11**, it follows that **9.9 > 9.11**.

The difference is:
9.9 - 9.11 = 0.79

**Why people get confused:** Many mistakenly treat the decimals like whole numbers (thinking "11 is bigger than 9") or confuse this with software version numbering (where version 9.11 comes after version 9.9). But in standard mathematics, 9.9 is definitively larger.

4.2.3 Preserve Thinking

Kimi-K2.7-Code keeps reasoning content across multi-turn interactions. This behavior is enabled by default and cannot be disabled.

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

messages = [
    {
        "role": "user",
        "content": "Tell me three random numbers."
    },
    {
        "role": "assistant",
        "reasoning_content": "I'll start by listing five numbers: 473, 921, 235, 215, 222, and I'll tell you the first three.",
        "content": "473, 921, 235"
    },
    {
        "role": "user",
        "content": "What are the other two numbers you have in mind?"
    }
]

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.7-Code",
    messages=messages,
    stream=False,
    max_tokens=4096,
)

print(response.choices[0].message.content)

Some OpenAI-compatible deployments use reasoning instead of reasoning_content in assistant messages. Use the field your serving stack exposes.

4.2.4 Tool Calling

Kimi-K2.7-Code supports tool calling capabilities for agentic tasks:

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.7-Code",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    stream=True
)

# Process streaming response
tool_calls_accumulator = {}

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            for tool_call in delta.tool_calls:
                index = tool_call.index
                if index not in tool_calls_accumulator:
                    tool_calls_accumulator[index] = {'name': None, 'arguments': ''}
                if tool_call.function:
                    if tool_call.function.name:
                        tool_calls_accumulator[index]['name'] = tool_call.function.name
                    if tool_call.function.arguments:
                        tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments

        if delta.content:
            print(delta.content, end="", flush=True)

for index, tool_call in sorted(tool_calls_accumulator.items()):
    print(f"Tool Call: {tool_call['name']}")
    print(f"  Arguments: {tool_call['arguments']}")

Output Example:

Output

Tool Call: get_weather
  Arguments: {"location": "Beijing"}

Handling Tool Call Results:

Example

# Send tool result back to the model
messages = [
    {"role": "user", "content": "What's the weather in Beijing?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": '{"location": "Beijing", "unit": "celsius"}'
            }
        }]
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": "The weather in Beijing is 22°C and sunny."
    }
]

final_response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.7-Code",
    messages=messages
)

print(final_response.choices[0].message.content)

Output Example:

Output

The weather in Beijing is currently **22°C and sunny**. ☀️

It's a nice, warm day there—great for being outdoors!

4.2.5 Multimodal + Tool Calling (Agentic Vision)

Combine vision understanding with tool calling for advanced agentic tasks:

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_product",
            "description": "Search for a product by name or description",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The product name or description to search for"
                    }
                },
                "required": ["query"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.7-Code",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Can you identify this product and search for similar items?"
                }
            ]
        }
    ],
    tools=tools
)

msg = response.choices[0].message

# Print reasoning process
if msg.reasoning_content:
    print("=== Reasoning ===")
    print(msg.reasoning_content)

# Print response content
if msg.content:
    print("=== Content ===")
    print(msg.content)

# Print tool calls
if msg.tool_calls:
    print("=== Tool Calls ===")
    for tc in msg.tool_calls:
        print(f"  Function: {tc.function.name}")
        print(f"  Arguments: {tc.function.arguments}")

Output Example:

Output

=== Reasoning ===
The user wants me to identify the product from the receipt and search for similar items. Looking at the receipt, it's from Auntie Anne's and the item purchased is "CINNAMON SUGAR" for 17,000 IDR. This is likely a Cinnamon Sugar Pretzel from Auntie Anne's, which is a popular pretzel chain.

I should search for this product using the search_product function. The query should be something like "Auntie Anne's Cinnamon Sugar Pretzel" or just "Cinnamon Sugar Pretzel" to find similar items.
=== Content ===
Based on the receipt, the product is a **Cinnamon Sugar Pretzel** from **Auntie Anne's** (a popular pretzel bakery chain). The receipt shows it was purchased for 17,000 Indonesian Rupiah (IDR).

Let me search for this product and similar items for you.
=== Tool Calls ===
  Function: search_product
  Arguments: {"query":"Auntie Anne's Cinnamon Sugar Pretzel"}

4.2.6 Deployment Command Example

Deploy Kimi-K2.7-Code with the following command (H200/B300, reasoning and tool parsing enabled):

Command

sglang serve \
  --model-path moonshotai/Kimi-K2.7-Code \
  --tp 8 \
  --reasoning-parser kimi_k2 \
  --tool-call-parser kimi_k2 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 30000

For GB300, use --tp 4.

5. Benchmark

The following results are from the official Kimi-K2.7-Code model card. They were evaluated with thinking mode enabled through Kimi Code CLI at temperature=1.0, top_p=0.95, and a 262,144-token context length unless otherwise stated.

Category	Benchmark	Kimi-K2.6	Kimi-K2.7-Code
Coding	Kimi Code Bench v2	50.9	62.0
Coding	Program Bench	48.3	53.6
Coding	MLS Bench Lite	26.7	35.1
Agentic	Kimi Claw 24/7 Bench	42.9	46.9
Agentic	MCP Atlas	69.4	76.0
Agentic	MCP Mark Verified	72.8	81.1

​1. Model Introduction

​2. SGLang Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Basic Usage

​4.2 Advanced Usage

​4.2.1 Multimodal (Vision + Text) Input

​4.2.2 Reasoning Output

​4.2.3 Preserve Thinking

​4.2.4 Tool Calling

​4.2.5 Multimodal + Tool Calling (Agentic Vision)

​4.2.6 Deployment Command Example

​5. Benchmark

1. Model Introduction

2. SGLang Installation

3. Model Deployment

3.1 Basic Configuration

3.2 Configuration Tips

4. Model Invocation

4.1 Basic Usage

4.2 Advanced Usage

4.2.1 Multimodal (Vision + Text) Input

4.2.2 Reasoning Output

4.2.3 Preserve Thinking

4.2.4 Tool Calling

4.2.5 Multimodal + Tool Calling (Agentic Vision)

4.2.6 Deployment Command Example

5. Benchmark