AMD GPU Support
1. Model Introduction
Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory. This generation delivers comprehensive upgrades across the board: Kimi Delta Attention (KDA): A linear attention mechanism that refines the gated delta rule with finegrained gating. Hybrid Architecture: A 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining or surpassing the quality of full attention. Superior Performance: Outperforms full attention in a variety of tasks, including long-context and RL-style benchmarks on 1.4T token training runs with fair comparisons. High Throughput: Achieves up to 6× faster decoding and significantly reduces time per output token (TPOT). For more details, please refer to the [official Kimi Linear GitHub Repository]: https://github.com/MoonshotAI/Kimi-Linear2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities.4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Launch the docker
Command
Command
4.2.2 pre-installation steps inside the docker
Command
4.2.3 Launch the server
Command
5. Benchmark
5.1 Speed Benchmark
Test Environment: Hardware: AMD MI300X GPU Model: Kimi-Linear-48B-A3B-Instruct Tensor Parallelism: 4 sglang version: 0.5.7- Model Deployment
Command
5.1.1 Low Concurrency (Latency-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.1.2 Medium Concurrency (Balanced)
- Benchmark Command:
Command
- Test Results:
Output
5.1.3 High Concurrency (Throughput-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Server Command
Command
- Benchmark Command
Command
- Result:
Output
