Skip to main content
This guide describes the best practice data for Kimi-K2.6 on the Ascend NPU.

Low Latency

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Kimi-K2.6Atlas 800I A38PD Mixed3.5K+1.5K20msW4A8 INT8Optimal Configuration

High Throughput

ModelHardwareCardsDeploy ModeDatasetTPOTQuantizationConfiguration
Kimi-K2.6Atlas 800I A316PD Mixed64K+1K100msW4A8 INT8Optimal Configuration
Kimi-K2.6Atlas 800I A316PD Disaggregation128K+1K100msW4A8 INT8Optimal Configuration
Kimi-K2.6Atlas 800I A316PD Disaggregation128K+1K (90% prefix cache hit rate)100msW4A8 INT8Optimal Configuration
Kimi-K2.6Atlas 800I A316PD Disaggregation64K+1.5K100msW4A8 INT8Optimal Configuration
Kimi-K2.6Atlas 800I A316PD Disaggregation64K+1.5K (90% prefix cache hit rate)100msW4A8 INT8Optimal Configuration
Kimi-K2.6Atlas 800I A324PD Disaggregation128K+1K (90% prefix cache hit rate)100msW4A8 INT8Optimal Configuration
Kimi-K2.6Atlas 800I A324PD Disaggregation64K+1.5K (90% prefix cache hit rate)100msW4A8 INT8Optimal Configuration
Kimi-K2.6Atlas 800I A38PD Mixed1024x1024 (30)+102450msW4A8 INT8Optimal Configuration
Kimi-K2.6Atlas 800I A38PD Mixed1080p_30+25650msW4A8 INT8Optimal Configuration
Kimi-K2.6Atlas 800I A38PD Mixed3.5K+1.5K50msW4A8 INT8Optimal Configuration

Optimal Configuration

Kimi-K2.6 W4A8 16P IN64K OUT1K 100ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 16 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 64K+1K TPOT: 100ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=4400
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --trust-remote-code \
    --attention-backend ascend \
    --device npu \
    --quantization modelslim \
    --dtype bfloat16 \
    --tp-size 32 \
    --nnodes 2 \
    --mem-fraction-static 0.55 \
    --max-running-requests 32 \
    --chunked-prefill-size 262144 \
    --context-length 75000 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --sampling-backend ascend \
    --enable-dp-attention \
    --dp-size 32 \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --cuda-graph-bs 1 \
    --disable-radix-cache \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --speculative-draft-model-quantization unquant

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 32 \
    --random-input-len 64000 \
    --random-output-len 1000 \
    --num-prompts 32 \
    --random-range-ratio 1

Kimi-K2.6 W4A8 1P1D 16P IN128K OUT1K 100ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 16 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 128K+1K TPOT: 100ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=60
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip>')
D_IP=('<your decode ip>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=8
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
        export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24699
        export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
        export ZBAL_ENABLE_GRAPH=1
        export ZBAL_HCCL_OP=send,recv
        export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank 0 \
        --quantization modelslim \
        --dtype bfloat16 \
        --disaggregation-transfer-backend ascend \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --disable-radix-cache \
        --disable-cuda-graph \
        --mem-fraction-static 0.78 \
        --max-running-requests 1 \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --chunked-prefill-size 16384 \
        --prefill-max-requests 1 \
        --max-prefill-tokens 131072 \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export SGLANG_NPU_USE_MLAPO=1
        export SGLANG_NPU_USE_MULTI_STREAM=1

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --quantization modelslim \
        --dtype bfloat16 \
        --disaggregation-transfer-backend ascend \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --mem-fraction-static 0.82 \
        --max-running-requests 1 \
        --enable-dp-attention \
        --dp-size 1 \
        --enable-dp-lm-head \
        --disable-radix-cache \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --cuda-graph-bs 1 2 4 6 8 12 16
        NODE_RANK=$i
        break
    fi
done
Command
# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip>: decode node IP address
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy cache_aware

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 1 \
    --random-input-len 131072 \
    --random-output-len 1024 \
    --num-prompts 1 \
    --random-range-ratio 1 \
    --request-rate inf

Kimi-K2.6 W4A8 1P1D 16P IN128K OUT1K PREFIX90 100ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 16 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 128K+1K (90% prefix cache hit rate) TPOT: 100ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=60
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip>')
D_IP=('<your decode ip>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=8
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
        export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24699
        export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
        export ZBAL_ENABLE_GRAPH=1
        export ZBAL_HCCL_OP=send,recv
        export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank 0 \
        --quantization modelslim \
        --dtype bfloat16 \
        --disaggregation-transfer-backend ascend \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --mem-fraction-static 0.78 \
        --max-running-requests 2 \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --chunked-prefill-size 16384 \
        --prefill-max-requests 2 \
        --max-prefill-tokens 65536 \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export SGLANG_NPU_USE_MLAPO=1
        export SGLANG_NPU_USE_MULTI_STREAM=1

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --quantization modelslim \
        --dtype bfloat16 \
        --disaggregation-transfer-backend ascend \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --mem-fraction-static 0.82 \
        --max-running-requests 2 \
        --enable-dp-attention \
        --dp-size 2 \
        --enable-dp-lm-head \
        --disable-radix-cache \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --cuda-graph-bs 1 2 4 6 8 12
        NODE_RANK=$i
        break
    fi
done
Command
# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip>: decode node IP address
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy cache_aware

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 117964 = int(131072 * 0.9) is the shared prefix portion. --gsp-question-len 13107 = int(131072 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 8 \
    --gsp-system-prompt-len 117964 \
    --gsp-question-len 13107 \
    --gsp-output-len 1024 \
    --max-concurrency 2 \
    --num-prompts 8 \
    --request-rate inf

Kimi-K2.6 W4A8 1P1D 16P IN64K OUT1K5 100ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 16 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 64K+1.5K TPOT: 100ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=60
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip>')
D_IP=('<your decode ip>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=8
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
        export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24699
        export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
        export ZBAL_ENABLE_GRAPH=1
        export ZBAL_HCCL_OP=send,recv
        export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank 0 \
        --quantization modelslim \
        --dtype bfloat16 \
        --disaggregation-transfer-backend ascend \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --disable-radix-cache \
        --disable-cuda-graph \
        --mem-fraction-static 0.78 \
        --max-running-requests 1 \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --chunked-prefill-size 16384 \
        --prefill-max-requests 1 \
        --max-prefill-tokens 65536 \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export SGLANG_NPU_USE_MLAPO=1
        export SGLANG_NPU_USE_MULTI_STREAM=1

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --quantization modelslim \
        --dtype bfloat16 \
        --disaggregation-transfer-backend ascend \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --mem-fraction-static 0.82 \
        --max-running-requests 16 \
        --enable-dp-attention \
        --dp-size 1 \
        --enable-dp-lm-head \
        --disable-radix-cache \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --cuda-graph-bs 16
        NODE_RANK=$i
        break
    fi
done
Command
# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip>: decode node IP address
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy cache_aware

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 1 \
    --random-input-len 65536 \
    --random-output-len 1536 \
    --num-prompts 1 \
    --random-range-ratio 1 \
    --request-rate inf

Kimi-K2.6 W4A8 1P1D 16P IN64K OUT1K5 PREFIX90 100ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 16 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 64K+1.5K (90% prefix cache hit rate) TPOT: 100ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=60
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip>')
D_IP=('<your decode ip>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip>:24670"

MODEL_PATH=/path/to/model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=8
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=0
        export SGLANG_ZBAL_BOOTSTRAP_URL=tcp://127.0.0.1:24699
        export SGLANG_ZBAL_LOCAL_MEM_SIZE=61184
        export ZBAL_ENABLE_GRAPH=1
        export ZBAL_HCCL_OP=send,recv
        export ZBAL_NPU_ALLOC_CONF=use_vmm_for_static_memory:True

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank 0 \
        --quantization modelslim \
        --dtype bfloat16 \
        --disaggregation-transfer-backend ascend \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --mem-fraction-static 0.78 \
        --max-running-requests 2 \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --chunked-prefill-size 16384 \
        --prefill-max-requests 2 \
        --max-prefill-tokens 65536 \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export SGLANG_NPU_USE_MLAPO=1
        export SGLANG_NPU_USE_MULTI_STREAM=1

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --quantization modelslim \
        --dtype bfloat16 \
        --disaggregation-transfer-backend ascend \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --mem-fraction-static 0.82 \
        --max-running-requests 2 \
        --enable-dp-attention \
        --dp-size 2 \
        --enable-dp-lm-head \
        --disable-radix-cache \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --cuda-graph-bs 1 2 4 6 8 12
        NODE_RANK=$i
        break
    fi
done
Command
# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip>: decode node IP address
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy cache_aware

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 58982 = int(65536 * 0.9) is the shared prefix portion. --gsp-question-len 6553 = int(65536 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 16 \
    --gsp-system-prompt-len 58982 \
    --gsp-question-len 6553 \
    --gsp-output-len 1536 \
    --max-concurrency 2 \
    --num-prompts 16 \
    --request-rate inf

Kimi-K2.6 W4A8 1P1D 24P IN128K OUT1K PREFIX90 100ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 24 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 128K+1K (90% prefix cache hit rate) TPOT: 100ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=60
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip>')
D_IP=('<your decode ip1>' '<your decode ip2>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip>:24670"

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1800
        export HCCL_SOCKET_IFNAME=<network-interface>

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank 0 \
        --quantization modelslim \
        --dtype bfloat16 \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --mem-fraction-static 0.78 \
        --max-running-requests 8 \
        --chunked-prefill-size 16384 \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend \
        --moe-a2a-backend deepep \
        --deepep-mode auto
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export SGLANG_NPU_USE_MLAPO=1
        export SGLANG_NPU_USE_MULTI_STREAM=1

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --quantization modelslim \
        --dtype bfloat16 \
        --nnodes 2 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 32 \
        --mem-fraction-static 0.82 \
        --max-running-requests 32 \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend \
        --enable-dp-attention \
        --dp-size 4 \
        --disable-radix-cache \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --cuda-graph-bs 8 \
        --speculative-algorithm EAGLE3 \
        --speculative-draft-model-path $DRAFT_MODEL_PATH \
        --speculative-num-steps 1 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 2 \
        --speculative-draft-model-quantization unquant
        NODE_RANK=$i
        break
    fi
done
Command
# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy cache_aware

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 117964 = int(131072 * 0.9) is the shared prefix portion. --gsp-question-len 13107 = int(131072 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 8 \
    --gsp-system-prompt-len 117964 \
    --gsp-question-len 13107 \
    --gsp-output-len 1024 \
    --max-concurrency 8 \
    --num-prompts 8 \
    --request-rate inf

Kimi-K2.6 W4A8 1P1D 24P IN64K OUT1K5 PREFIX90 100ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 24 Deploy Mode: PD Disaggregation Quantization: W4A8 INT8 Dataset: 64K+1.5K (90% prefix cache hit rate) TPOT: 100ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   P_IP: prefill node IP address
#   D_IP: decode node IP address
#   ASCEND_MF_STORE_URL: prefill node IP with port
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================


echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=60
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

P_IP=('<your prefill ip>')
D_IP=('<your decode ip1>' '<your decode ip2>')

export ASCEND_MF_STORE_URL="tcp://<your prefill ip>:24670"

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
# prefill
for i in "${!P_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
    then
        echo "${P_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1800
        export HCCL_SOCKET_IFNAME=<network-interface>

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode prefill \
        --host ${P_IP[$i]} \
        --port 8000 \
        --disaggregation-bootstrap-port 8998 \
        --node-rank 0 \
        --quantization modelslim \
        --dtype bfloat16 \
        --nnodes 1 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 16 \
        --mem-fraction-static 0.78 \
        --max-running-requests 8 \
        --chunked-prefill-size 16384 \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend \
        --moe-a2a-backend deepep \
        --deepep-mode auto
        NODE_RANK=$i
        break
    fi
done

# decode
for i in "${!D_IP[@]}";
do
    if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
    then
        echo "${D_IP[$i]}"
        export GLOO_SOCKET_IFNAME=<network-interface>
        export HCCL_BUFFSIZE=1200
        export HCCL_SOCKET_IFNAME=<network-interface>
        export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
        export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
        export SGLANG_ENABLE_SPEC_V2=1
        export SGLANG_NPU_USE_MLAPO=1
        export SGLANG_NPU_USE_MULTI_STREAM=1

        python3 -m sglang.launch_server \
        --model-path ${MODEL_PATH} \
        --disaggregation-mode decode \
        --host ${D_IP[$i]} \
        --port 8001 \
        --dist-init-addr ${D_IP[0]}:5000 \
        --node-rank $i \
        --quantization modelslim \
        --dtype bfloat16 \
        --nnodes 2 \
        --trust-remote-code \
        --attention-backend ascend \
        --device npu \
        --tp-size 32 \
        --mem-fraction-static 0.82 \
        --max-running-requests 32 \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --sampling-backend ascend \
        --enable-dp-attention \
        --dp-size 4 \
        --disable-radix-cache \
        --moe-a2a-backend deepep \
        --deepep-mode auto \
        --cuda-graph-bs 8 \
        --speculative-algorithm EAGLE3 \
        --speculative-draft-model-path $DRAFT_MODEL_PATH \
        --speculative-num-steps 1 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 2 \
        --speculative-draft-model-quantization unquant
        NODE_RANK=$i
        break
    fi
done
Command
# ============================================================
# Before running, replace the following placeholders:
#   <your prefill ip>: prefill node IP address
#   <your decode ip1>: first decode node IP address (decode may have distributed nodes)
# ============================================================

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://<your prefill ip>:8000 8998 \
    --decode http://<your decode ip1>:8001 \
    --host 127.0.0.1 \
    --port 6688 \
    --policy cache_aware

Benchmark

We tested it based on the generated-shared-prefix dataset with 90% cache hit (repeat_rate = 0.9): --gsp-system-prompt-len 58982 = int(65536 * 0.9) is the shared prefix portion. --gsp-question-len 6553 = int(65536 * (1 - 0.9)) is the unique per-request suffix. --gsp-num-groups 1 keeps all requests in one prefix group for maximum cache reuse.
Command
python -m sglang.bench_serving \
    --dataset-name generated-shared-prefix \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --gsp-num-groups 1 \
    --gsp-prompts-per-group 16 \
    --gsp-system-prompt-len 58982 \
    --gsp-question-len 6553 \
    --gsp-output-len 1536 \
    --max-concurrency 16 \
    --num-prompts 16 \
    --request-rate inf

Kimi-K2.6 W4A8 8P IN1024X1024 30 OUT1024 50ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 1024x1024 (30)+1024 Format: resolution (input tokens) + output tokens TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export HCCL_BUFFSIZE=1500
export HCCL_OP_EXPANSION_MODE=AIV
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=112
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --quantization modelslim \
    --dtype bfloat16 \
    --model-loader-extra-config {"enable_multithread_load": true} \
    --trust-remote-code \
    --device npu \
    --attention-backend ascend \
    --tp-size 16 \
    --mem-fraction-static 0.76 \
    --max-running-requests 176 \
    --chunked-prefill-size 32768 \
    --context-length 8192 \
    --max-prefill-tokens 16384 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --sampling-backend ascend \
    --enable-dp-attention \
    --dp-size 16 \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --cuda-graph-bs 1 2 4 8 9 10 11 \
    --disable-radix-cache \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 2 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 3 \
    --speculative-draft-model-quantization unquant \
    --prefill-delayer-max-delay-passes 200 \
    --enable-prefill-delayer

Benchmark

We tested it based on the IMAGE dataset with 1024x1024 resolution.
Command
python -m sglang.bench_serving \
    --dataset-name image \
    --backend sglang-oai-chat \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 160 \
    --random-input-len 30 \
    --random-output-len 1024 \
    --num-prompts 640 \
    --random-range-ratio 1 \
    --request-rate inf \
    --warmup-requests 16 \
    --image-count 1 \
    --image-resolution 1024x1024

Kimi-K2.6 W4A8 8P IN1080P 30 OUT256 50ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 1080p_30+256 TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export HCCL_BUFFSIZE=1800
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --quantization modelslim \
    --dtype bfloat16 \
    --model-loader-extra-config {"enable_multithread_load": true} \
    --trust-remote-code \
    --device npu \
    --attention-backend ascend \
    --tp-size 16 \
    --mem-fraction-static 0.7 \
    --max-running-requests 80 \
    --chunked-prefill-size -1 \
    --context-length 8192 \
    --prefill-max-requests 1 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --sampling-backend ascend \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --enable-dp-attention \
    --dp-size 16 \
    --cuda-graph-bs 1 2 4 6 8 10 \
    --disable-radix-cache \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 5 \
    --speculative-draft-model-quantization unquant

Benchmark

We tested it based on the IMAGE dataset with 1920x1080 resolution.
Command
python -m sglang.bench_serving \
    --dataset-name image \
    --backend sglang-oai-chat \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 20 \
    --random-input-len 30 \
    --random-output-len 256 \
    --num-prompts 20 \
    --random-range-ratio 1 \
    --request-rate inf \
    --image-count 1 \
    --image-resolution 1920x1080

Kimi-K2.6 W4A8 8P IN3K5 OUT1K5 20ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 3.5K+1.5K TPOT: 20ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=1200
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=96
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --trust-remote-code \
    --attention-backend ascend \
    --device npu \
    --quantization modelslim \
    --dtype bfloat16 \
    --tp-size 16 \
    --mem-fraction-static 0.753 \
    --max-running-requests 80 \
    --chunked-prefill-size 32768 \
    --context-length 6144 \
    --max-prefill-tokens 65536 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --sampling-backend ascend \
    --enable-dp-attention \
    --dp-size 16 \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --cuda-graph-bs 1 2 3 4 5 \
    --disable-radix-cache \
    --model-loader-extra-config {"enable_multithread_load": true} \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 5 \
    --speculative-draft-model-quantization unquant \
    --prefill-delayer-max-delay-passes 200 \
    --enable-prefill-delayer

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 64 \
    --random-input-len 3500 \
    --random-output-len 1500 \
    --num-prompts 256 \
    --random-range-ratio 1 \
    --warmup-requests 0

Kimi-K2.6 W4A8 8P IN3K5 OUT1K5 50ms

Model: Kimi-K2.6 Hardware: Atlas 800I A3 Cards: 8 Deploy Mode: PD Mixed Quantization: W4A8 INT8 Dataset: 3.5K+1.5K TPOT: 50ms

Model Deployment

Command
# ============================================================
# Before running, update the following variables:
#   MODEL_PATH: path to the model weights directory
#   DRAFT_MODEL_PATH: path to the draft model weights directory
#   HCCL_SOCKET_IFNAME: network interface name for HCCL
#   GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================

MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_BUFFSIZE=1200
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=96
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_SET_CPU_AFFINITY=1
export STREAMS_PER_DEVICE=32

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 127.0.0.1 --port 6688 \
    --trust-remote-code \
    --attention-backend ascend \
    --device npu \
    --quantization modelslim \
    --dtype bfloat16 \
    --tp-size 16 \
    --mem-fraction-static 0.783 \
    --max-running-requests 208 \
    --chunked-prefill-size 32768 \
    --context-length 6144 \
    --max-prefill-tokens 16384 \
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --sampling-backend ascend \
    --enable-dp-attention \
    --dp-size 16 \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --cuda-graph-bs 1 2 4 8 12 13 \
    --disable-radix-cache \
    --model-loader-extra-config {"enable_multithread_load": true} \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path $DRAFT_MODEL_PATH \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 5 \
    --speculative-draft-model-quantization unquant \
    --prefill-delayer-max-delay-passes 200 \
    --enable-prefill-delayer

Benchmark

We tested it based on the RANDOM dataset.
Command
python -m sglang.bench_serving \
    --dataset-name random \
    --backend sglang \
    --host 127.0.0.1 \
    --port 6688 \
    --max-concurrency 192 \
    --random-input-len 3500 \
    --random-output-len 1500 \
    --num-prompts 768 \
    --random-range-ratio 1 \
    --warmup-requests 0