Environment Preparation
Installation
Ensure sufficient disk space before pulling images. The Docker image requires at least 30 GB of free space. If you need to download model weights, check the model size at ModelScope to reserve enough space.
- Atlas 800I A3
- Atlas 800I A2
Command
docker pull quay.io/ascend/sglang:v0.5.10-npu.rc1-a3
docker run -itd --shm-size=16g --privileged=true --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci1:/dev/davinci1 \
--device=/dev/davinci2:/dev/davinci2 \
--device=/dev/davinci3:/dev/davinci3 \
--device=/dev/davinci4:/dev/davinci4 \
--device=/dev/davinci5:/dev/davinci5 \
--device=/dev/davinci6:/dev/davinci6 \
--device=/dev/davinci7:/dev/davinci7 \
--device=/dev/davinci8:/dev/davinci8 \
--device=/dev/davinci9:/dev/davinci9 \
--device=/dev/davinci10:/dev/davinci10 \
--device=/dev/davinci11:/dev/davinci11 \
--device=/dev/davinci12:/dev/davinci12 \
--device=/dev/davinci13:/dev/davinci13 \
--device=/dev/davinci14:/dev/davinci14 \
--device=/dev/davinci15:/dev/davinci15 \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:v0.5.10-npu.rc1-a3
Command
docker pull quay.io/ascend/sglang:v0.5.10-npu.rc1-910b
docker run -itd --shm-size=16g --privileged=true --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci1:/dev/davinci1 \
--device=/dev/davinci2:/dev/davinci2 \
--device=/dev/davinci3:/dev/davinci3 \
--device=/dev/davinci4:/dev/davinci4 \
--device=/dev/davinci5:/dev/davinci5 \
--device=/dev/davinci6:/dev/davinci6 \
--device=/dev/davinci7:/dev/davinci7 \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:v0.5.10-npu.rc1-910b
Deployment
Single-node Deployment
Run the following script to execute online inference.Qwen3.5 397B
Recommended model:
Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtpCommand
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 16 --nnodes 1 --node-rank 0 \
--chunked-prefill-size 4096 --max-prefill-tokens 280000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.7 \
--port 8000 \
--cuda-graph-bs 16 \
--enable-multimodal \
--mm-attention-backend ascend_attn \
--dtype bfloat16
Qwen3.5 122B
Recommended model:
Eco-Tech/Qwen3.5-122B-A10B-w8a8-mtpCommand
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 8 --nnodes 1 --node-rank 0 \
--chunked-prefill-size 4096 --max-prefill-tokens 280000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.7 \
--port 8000 \
--cuda-graph-bs 16 \
--enable-multimodal \
--mm-attention-backend ascend_attn \
--dtype bfloat16
Qwen3.5 35B
Recommended model:
Eco-Tech/Qwen3.5-35B-A3B-w8a8-mtpCommand
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 2 --nnodes 1 --node-rank 0 \
--chunked-prefill-size 4096 --max-prefill-tokens 280000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.7 \
--port 8000 \
--cuda-graph-bs 16 \
--enable-multimodal \
--mm-attention-backend ascend_attn \
--dtype bfloat16
Qwen3.5 27B
Recommended model:
Eco-Tech/Qwen3.5-27B-w8a8-mtpCommand
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 2 \
--chunked-prefill-size -1 --max-prefill-tokens 120000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.8 \
--port 8000 \
--cuda-graph-bs 32 \
--enable-multimodal \
--mm-attention-backend ascend_attn
Multi-node Deployment
Recommended model:
Qwen/Qwen3.5-35B-A3BOther Qwen3.5 series models can also be deployed in multi-node configurations following this workflow. Simply change --model-path to the corresponding model, and adjust parameters like --tp-size, --nnodes, and --mem-fraction-static according to the model size and available resources.Command
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export HCCL_BUFFSIZE=1000
# Run command ifconfig on two nodes, find out which inet addr has same IP with your node IP. That is your public interface, which should be added here
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
P_IP=('your ip1' 'your ip2')
P_MASTER="${P_IP[0]}:your port"
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 8 --nnodes 2 --node-rank $i --dist-init-addr $P_MASTER \
--chunked-prefill-size 16384 --max-prefill-tokens 131072 \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.8\
--port 8000 \
--served-model-name qwen3.5 \
--cuda-graph-max-bs 16 \
--disable-radix-cache
NODE_RANK=$i
break
fi
done
Prefill-Decode Disaggregation
Not tested yet.Testing the Service
Once the server printsThe server is fired up and ready to roll! in the logs, it is ready to accept requests. For testing examples (Health Check, Generate, Chat Completions, Multimodal Chat Completions, and port usage guidance), see Testing the Service.
