Kimi-K2.6 - SGLang Documentation

Introduction

Kimi-K2.6 is an open-source, native multimodal agentic model developed by Moonshot AI, built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It is a Mixture-of-Experts (MoE) model featuring Multi-head Latent Attention (MLA) and MoE architecture, with 1T total parameters and 32B active parameters. The model seamlessly integrates vision and language understanding with advanced agentic capabilities, supporting both instant and thinking modes as well as conversational and agentic paradigms. This document demonstrates the deployment of Kimi-K2.6 on Ascend NPUs using SGLang, including single-node PD mixed mode, feature configuration, and performance optimization. This document is validated and written based on SGLang v0.5.13. The current model (Kimi-K2.6) is fully supported in this version. To use the latest features (e.g., speculative decoding, multimodal), it is recommended to use v0.5.13 or a later version.

Supported features

Feature	Example usage
Tensor Parallelism	`--tp-size 16`
Data Parallelism	`--dp-size 16`
Expert Parallelism	`--ep-size 16 \` `--moe-a2a-backend deepep \` `--deepep-mode auto`
PD Disaggregation	`--disaggregation-mode prefill \` `--disaggregation-transfer-backend ascend`
Quantization	`--quantization modelslim`
Chunked Prefill	auto based on device memory, or set explicit value; disable with `--chunked-prefill-size -1`; e.g. `--chunked-prefill-size 32768`
NPU Graph	enabled by default; disable with `--disable-cuda-graph`; control range via `--cuda-graph-bs` or `--cuda-graph-max-bs`; e.g. `--cuda-graph-bs 1 2 4 8 12 16 24 32 48 64 96 120`
Speculative Decoding	`--speculative-algorithm EAGLE3 \` `--speculative-draft-model-path /path/to/draft-model-weights \` `--speculative-num-steps 4 \` `--speculative-eagle-topk 1 \` `--speculative-num-draft-tokens 5 \` `--speculative-draft-model-quantization unquant`
Overlap Schedule	`export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1`
DP LM Head	`--enable-dp-lm-head`
MLAPO	`export SGLANG_NPU_USE_MLAPO=1`
Multistream MoE	`export SGLANG_NPU_USE_MULTI_STREAM=1`

The values in the Example usage column are for illustration only. Adjust them according to your hardware, deployment mode, and workload. For parameter details, see Feature descriptions; for recommended configurations for each deployment scenario, see Best practices.

For feature compatibility and conflict information between features, see Feature Compatibility.

Prerequisites

Model weights

If you need to download model weights, check the model size before downloading to reserve enough space.

Kimi-K2.6 (BF16)
Kimi-K2.6-w4a8 (W4A8 quantized version)
kimi-k2.6-eagle3 (EAGLE3 draft model for speculative decoding)

You can use msmodelslim to quantize Kimi-K2.6-w4a8 from Kimi-K2.6. Ensure the available device memory exceeds the model weight size before deployment. For optimal throughput and latency, refer to the best practice configurations which may require additional cards. It is recommended to download the model weights to a shared directory across multiple nodes.

Installation

Ensure sufficient disk space before pulling images. The Docker image requires at least 30 GB of free space.

The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the online platform. You can directly pull it. Both stable releases and daily builds are available. The following command is based on the stable release tag. For details, see Docker image versions.

Atlas 800I A3
Atlas 800I A2

Command

docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3

docker run -itd --shm-size=16g --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0  \
--device=/dev/davinci1:/dev/davinci1  \
--device=/dev/davinci2:/dev/davinci2  \
--device=/dev/davinci3:/dev/davinci3  \
--device=/dev/davinci4:/dev/davinci4  \
--device=/dev/davinci5:/dev/davinci5  \
--device=/dev/davinci6:/dev/davinci6  \
--device=/dev/davinci7:/dev/davinci7  \
--device=/dev/davinci8:/dev/davinci8  \
--device=/dev/davinci9:/dev/davinci9  \
--device=/dev/davinci10:/dev/davinci10  \
--device=/dev/davinci11:/dev/davinci11  \
--device=/dev/davinci12:/dev/davinci12  \
--device=/dev/davinci13:/dev/davinci13  \
--device=/dev/davinci14:/dev/davinci14  \
--device=/dev/davinci15:/dev/davinci15  \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3

Command

docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b

docker run -itd --shm-size=16g --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0  \
--device=/dev/davinci1:/dev/davinci1  \
--device=/dev/davinci2:/dev/davinci2  \
--device=/dev/davinci3:/dev/davinci3  \
--device=/dev/davinci4:/dev/davinci4  \
--device=/dev/davinci5:/dev/davinci5  \
--device=/dev/davinci6:/dev/davinci6  \
--device=/dev/davinci7:/dev/davinci7  \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b

If the model weights have already been downloaded to a shared directory, use -v to mount the model path into the container, for example: -v /path/to/models:/models.
Replace ${NAME} with your own container name or remove --name to use default name.

Online service deployment

Single-node online deployment

Single-node deployment completes both prefill and decode within the same node (PD mixed mode), suitable for scenarios with limited hardware resources. This scenario is already covered in the best practice. For the complete, optimized deployment commands and benchmark data, see Kimi K2.6 Best Practice — PD Mixed On A3.

Multi-node online deployment

Multi-node deployment distributes the model across multiple Atlas 800I A3 nodes using tensor parallelism while keeping prefill and decode on the same nodes (PD mixed mode), suitable for scenarios that need more device memory than a single node can provide. This scenario is already covered in the best practice. For the complete, optimized deployment commands and benchmark data, see Kimi-K2.6 Best Practice — Multi-node On A3.

Multi-node PD disaggregation deployment

PD disaggregation splits the prefill and decode stages onto separate nodes, reducing interference and improving throughput for high-concurrency scenarios. This scenario is already covered in the best practice. For the complete, optimized deployment commands and benchmark data, see Kimi-K2.6 Best Practice — PD Disaggregation On A3.

Functional verification

After the service is started, you can invoke the model by sending a prompt:

# ============================================================
# Before running, update the following variables:
#   HOST: the server host address (e.g., localhost)
#   PORT: the server port number (e.g., 6689)
# ============================================================

curl http://${HOST}:${PORT}/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "What is the capital of France?",
        "sampling_params": {
            "max_new_tokens": 64,
            "temperature": 0
        }
    }'

Expected result: an HTTP 200 response with the generated text containing “Paris”. Once the server prints The server is fired up and ready to roll! in the logs, it is ready to accept requests. For more testing examples (Health Check, Generate, Chat Completions, and port usage guidance), see Testing the Service.

Accuracy evaluation

For accuracy evaluation methods and datasets, see Accuracy Evaluation on Ascend NPU.

Performance

For performance data and benchmark commands, see Performance Testing on Ascend NPU.

Best practices

Best practice configuration reference

For complete optimal configurations with deployment scripts and benchmark commands, see the Kimi-K2.6 Best Practice page.

Performance tuning

For the full list of supported features, see Supported features. For detailed optimization guidance, see Optimization on Ascend NPU.

FAQ

For common environment, installation, and general parameter issues, please refer to the Ascend NPU FAQ. This section only covers model-specific issues.

​Introduction

​Supported features

​Prerequisites

​Model weights

​Installation

​Online service deployment

​Single-node online deployment

​Multi-node online deployment

​Multi-node PD disaggregation deployment

​Functional verification

​Accuracy evaluation

​Performance

​Best practices

​Best practice configuration reference

​Performance tuning

​FAQ

Introduction

Supported features

Prerequisites

Model weights

Installation

Online service deployment

Single-node online deployment

Multi-node online deployment

Multi-node PD disaggregation deployment

Functional verification

Accuracy evaluation

Performance

Best practices

Best practice configuration reference

Performance tuning

FAQ