1. Model Introduction
Kimi-K2.7-Code is a coding-focused agentic model by Moonshot AI, built on top of Kimi-K2.6. It improves real-world long-horizon coding task completion while reducing thinking-token usage by approximately 30% compared with Kimi-K2.6. Key Features:- Coding-Focused Agentic Model: Optimized for end-to-end coding workflows and complex software engineering tasks.
- Token Efficiency: Reduces thinking-token usage by approximately 30% versus Kimi-K2.6.
- K2.6-Compatible Deployment: Shares the same architecture as Kimi-K2.5/Kimi-K2.6, so the SGLang deployment method can be reused with the new model ID.
- Native Multimodality: Shares Kimi-K2.6’s native multimodal architecture with a MoonViT vision encoder (400M parameters) and supports image and video (experimental) input.
| Benchmark | Kimi-K2.6 | Kimi-K2.7-Code |
|---|---|---|
| Kimi Code Bench v2 | 50.9 | 62.0 |
| Program Bench | 48.3 | 53.6 |
| MLS Bench Lite | 26.7 | 35.1 |
| Kimi Claw 24/7 Bench | 42.9 | 46.9 |
| MCP Atlas | 69.4 | 76.0 |
| MCP Mark Verified | 72.8 | 81.1 |
- Thinking Mode:
temperature=1.0,top_p=0.95 - Kimi-K2.7-Code forces thinking and preserve-thinking behavior; instant mode is not supported.
- INT4 (native checkpoint): moonshotai/Kimi-K2.7-Code
2. SGLang Installation
Refer to the official SGLang installation guide.3. Model Deployment
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and capabilities.3.2 Configuration Tips
- Memory: Requires GPUs with ≥140GB each. The native INT4 checkpoint supports H200 (8×, TP=8), B300 (8×, TP=8), GB300 (4×, TP=4), MI300X/MI325X (4×, TP=4), and MI350X/MI355X (4×, TP=4). Use
--context-length 128000to conserve memory. - Context Length: The model supports a 256K context length. Use a shorter
--context-lengthwhen you need to reserve memory for larger batches. - Transformers Version: The model card requires
transformers>=4.57.1,<5.0.0. - AMD GPU TP Constraint: On AMD GPUs, TP must be ≤ 4 (not 8). Kimi-K2.7-Code has 64 attention heads; the AITER MLA kernel requires
heads_per_gpu % 16 == 0. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid). - AMD Docker Image: Use
lmsysorg/sglang:v0.5.9-rocm700-mi35xfor MI350X/MI355X andlmsysorg/sglang:v0.5.9-rocm700-mi30xfor MI300X/MI325X. - DP Attention: Enable with
--dp <N> --enable-dp-attentionfor production throughput. A common choice is to set--dpequal to--tp, but this is not required. - Reasoning Parser: Add
--reasoning-parser kimi_k2to separate thinking and content in model outputs. - Tool Call Parser: Add
--tool-call-parser kimi_k2for structured tool calls. - AMD FP8 KV Cache: On AMD platforms the generator adds
--kv-cache-dtype fp8_e4m3by default and sets--mem-fraction-static 0.8to fit the INT4 weights plus KV cache. FP8 KV cache trades a small amount of accuracy for memory; omit the flag if you observe accuracy regressions on your workload.
4. Model Invocation
4.1 Basic Usage
See Basic API Usage.4.2 Advanced Usage
4.2.1 Multimodal (Vision + Text) Input
Kimi-K2.7-Code supports native multimodal input with images:Example
Output
4.2.2 Reasoning Output
Kimi-K2.7-Code forces thinking mode and preserve-thinking behavior. Thinking Mode (default) — reasoning content is automatically separated:Example
Output
4.2.3 Preserve Thinking
Kimi-K2.7-Code keeps reasoning content across multi-turn interactions. This behavior is enabled by default and cannot be disabled.Example
reasoning instead of reasoning_content in assistant messages. Use the field your serving stack exposes.
4.2.4 Tool Calling
Kimi-K2.7-Code supports tool calling capabilities for agentic tasks:Example
Output
Example
Output
4.2.5 Multimodal + Tool Calling (Agentic Vision)
Combine vision understanding with tool calling for advanced agentic tasks:Example
Output
4.2.6 Deployment Command Example
Deploy Kimi-K2.7-Code with the following command (H200/B300, reasoning and tool parsing enabled):Command
--tp 4.
5. Benchmark
The following results are from the official Kimi-K2.7-Code model card. They were evaluated with thinking mode enabled through Kimi Code CLI attemperature=1.0, top_p=0.95, and a 262,144-token context length unless otherwise stated.
| Category | Benchmark | Kimi-K2.6 | Kimi-K2.7-Code |
|---|---|---|---|
| Coding | Kimi Code Bench v2 | 50.9 | 62.0 |
| Coding | Program Bench | 48.3 | 53.6 |
| Coding | MLS Bench Lite | 26.7 | 35.1 |
| Agentic | Kimi Claw 24/7 Bench | 42.9 | 46.9 |
| Agentic | MCP Atlas | 69.4 | 76.0 |
| Agentic | MCP Mark Verified | 72.8 | 81.1 |
