SGLang installation with NPUs support

You can install SGLang using any of the methods below. Please go through System Settings section to ensure the clusters are operating at optimal performance. Feel free to leave an issue here at sglang if you encounter any issues or have any problems.

Component Version Mapping For SGLang

Component	Version	Obtain Way
HDK	25.5.2	link
CANN	9.0.0	Obtain Images
TorchNPU	26.0.0	link
MemFabric	1.0.8	`pip install memfabric-hybrid==1.0.8`
Triton	3.2.1.dev20260530	`pip install triton-ascend==3.2.1.dev20260530 \` `--extra-index-url=https://mirrors.huaweicloud.com/ascend/repos/pypi/nightly \` `--trusted-host mirrors.huaweicloud.com`
SGLang NPU Kernel	2026.05.01.post3	link
MemFabric-zbal	1.1.1	`pip install memfabric-zbal==1.1.1`

Obtain CANN Image

Ensure sufficient disk space before pulling images. Each Docker image requires at least 30GB of free space.

You can obtain the dependency of a specified version of CANN through an image.

Atlas 800I A3
Atlas 800I A2

Command

docker pull quay.io/ascend/cann:9.0.0-a3-ubuntu22.04-py3.11

Command

docker pull quay.io/ascend/cann:9.0.0-910b-ubuntu22.04-py3.11

Preparing the Running Environment

Method 1: Installing from source with prerequisites

Python Version

Only python==3.11 is supported currently. If you don’t want to break system pre-installed python, try installing with conda.

Command

conda create --name sglang_npu python=3.11
conda activate sglang_npu

Note on Anaconda repository restrictions If you encounter an error like “Terms of Service have not been accepted” during the conda create step, the default Anaconda repository is blocking package downloads. To resolve this, configure a mirror (e.g., Tsinghua Open Source Mirror):

Command

# Add Tsinghua mirrors
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
conda config --set show_channel_urls yes
conda config --remove channels defaults

Edit the system-level conda config to remove any hardcoded defaults, e.g., vi ~/miniconda3/.condarc Then remove the failed environment and recreate it:

Command

conda clean -i
conda env remove -n sglang_npu
conda create --name sglang_npu python=3.11
conda activate sglang_npu

CANN

Prior to start work with SGLang on Ascend you need to install CANN Toolkit, Kernels operator package and NNAL version 9.0.0, check the installation guide

MemFabric-Hybrid

If you want to use PD disaggregation mode, you need to install MemFabric-Hybrid. MemFabric-Hybrid is a drop-in replacement of Mooncake Transfer Engine that enables KV cache transfer on Ascend NPU clusters.

Command

pip install memfabric-hybrid==1.0.8

MemFabric-zbal

MemFabric-zbal is a Zero Buffer Acceleration Library of high-performance operators for LLM inference and training on Ascend, accelerating computation by eliminating intermediate memory buffers; it is required only on aarch64 clusters and is installed in addition to MemFabric-Hybrid.

Command

# Only needed on aarch64 (arm64) hosts
pip install memfabric-zbal==1.1.1

PyTorch and PyTorch Framework Adaptor on Ascend

Command

PYTORCH_VERSION=2.10.0
TORCHVISION_VERSION=0.25.0
TORCH_NPU_VERSION=2.10.0
pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu
pip install torch_npu==$TORCH_NPU_VERSION

If you are using other versions of torch and install torch_npu, check installation guide

Triton on Ascend

We provide our own implementation of Triton for Ascend.

Command

pip install triton-ascend==3.2.1.dev20260530 \
  --extra-index-url=https://mirrors.huaweicloud.com/ascend/repos/pypi/nightly \
  --trusted-host mirrors.huaweicloud.com

For installation of Triton on Ascend nightly builds or from sources, follow installation guide

SGLang Kernels NPU

We provide SGL kernels for Ascend NPU, check installation guide.

DeepEP-compatible Library

We provide a DeepEP-compatible Library as a drop-in replacement of deepseek-ai’s DeepEP library, check the installation guide.

Some other dependencies

Command

# libGL
apt update
apt install libgl1 libglib2.0-0

# ensure setuptools contains pkg_resources module
pip install "setuptools<80"

Installing SGLang from source

Command

# Use the last release branch
git clone https://github.com/sgl-project/sglang.git
cd sglang
mv python/pyproject_npu.toml python/pyproject.toml
pip install -e python[all_npu]

Method 2: Using Docker Image

Obtain Image

You can download the SGLang image or build an image based on Dockerfile to obtain the Ascend NPU image.

Ensure sufficient disk space before pulling images. Each Docker image requires at least 30GB of free space. If you need to download model weights, check the model size at ModelScope to reserve enough space.

Download SGLang image

We publish both stable releases and daily builds. Choose a stable release tag (e.g., v0.5.13.post1-cann9.0.0-a3) if you prefer a validated version, or a daily build tag (e.g., main-cann9.0.0-a3) if you need the latest development changes.

Atlas 800I A3
Atlas 800I A2

Command

# Stable release
docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3

# Daily build
docker pull quay.io/ascend/sglang:main-cann9.0.0-a3

Command

# Stable release
docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b

# Daily build
docker pull quay.io/ascend/sglang:main-cann9.0.0-910b

Build an image based on Dockerfile

Command

# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker

# Build the docker image
# Replace <arch_tag> with the target architecture, e.g., amd64, arm64.
# Optional build arguments:
#   --build-arg DEVICE_TYPE=910b          # Required for Atlas 800I A2
#   --build-arg APTMIRROR=<mirror_url>    # Use a custom APT mirror to improve download speed
# If there are network errors, please modify the Dockerfile to add ARG HTTP_PROXY/HTTPS_PROXY and set them as ENV.
docker build --build-arg TARGETARCH=<arch_tag> -t <image_name> -f npu.Dockerfile .

Create Docker

Notice: --privileged and --network=host are required by RDMA, which is typically needed by Ascend NPU clusters.

Atlas 800I A3
Atlas 800I A2

Command

# Create a shortcut 'drun' to launch a privileged Docker container
alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
    --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
    --device=/dev/davinci_manager --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/'

# Add HF_TOKEN env for download model by SGLang.
# The container runs with the '--rm' flag, so it will be automatically removed after the command finishes (including Ctrl+C)
drun --env "HF_TOKEN=<secret>" \
    <image_name> \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend

Command

# Create a shortcut 'drun' to launch a privileged Docker container
alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci_manager --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/'

# Add HF_TOKEN env for download model by SGLang.
# The container runs with the '--rm' flag, so it will be automatically removed after the command finishes (including Ctrl+C)
drun --env "HF_TOKEN=<secret>" \
    <image_name> \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend

SGLang will serve on http://127.0.0.1:30000 by default. You can change the host and port by --host and --port parameters.

System Settings

CPU performance power scheme

The default power scheme on Ascend hardware is ondemand which could affect performance, changing it to performance is recommended.

Command

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Make sure changes are applied successfully
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # shows performance

Disable NUMA balancing

Command

sudo sysctl -w kernel.numa_balancing=0
# Check
cat /proc/sys/kernel/numa_balancing # shows 0

Prevent swapping out system memory

Command

sudo sysctl -w vm.swappiness=10

# Check
cat /proc/sys/vm/swappiness # shows 10

Running SGLang Service

Running Service For Large Language Models

PD Mixed Scene

Command

# Enabling CPU Affinity
export SGLANG_SET_CPU_AFFINITY=1
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --attention-backend ascend \
  --host 127.0.0.1 \
  --port 30000

PD Disaggregation Scene

Launch Prefill Server

Atlas 800I A3
Atlas 800I A2

Command

# Enabling CPU Affinity
export SGLANG_SET_CPU_AFFINITY=1

# PREFILL_IP: IP address of the first Prefill Server
# FREE_PORT: any available port
# all SGLang servers need to be configured with the same PREFILL_IP and FREE_PORT
export ASCEND_MF_STORE_URL="tcp://PREFILL_IP:FREE_PORT"
python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend ascend \
    --disaggregation-bootstrap-port 8995 \
    --attention-backend ascend \
    --device npu \
    --base-gpu-id 0 \
    --tp-size 1 \
    --host 127.0.0.1 \
    --port 30001

Command

# Enabling CPU Affinity
export SGLANG_SET_CPU_AFFINITY=1

# PREFILL_IP: IP address of the first Prefill Server
# FREE_PORT: any available port
# all SGLang servers need to be configured with the same PREFILL_IP and FREE_PORT
export ASCEND_MF_STORE_URL="tcp://PREFILL_IP:FREE_PORT"
export ASCEND_MF_TRANSFER_PROTOCOL="device_rdma"
python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend ascend \
    --disaggregation-bootstrap-port 8995 \
    --attention-backend ascend \
    --device npu \
    --base-gpu-id 0 \
    --tp-size 1 \
    --host 127.0.0.1 \
    --port 30001

Launch Decode Server

Atlas 800I A3
Atlas 800I A2

Command

# PREFILL_IP: IP address of the first Prefill Server
# FREE_PORT: any available port
# all SGLang servers need to be configured with the same PREFILL_IP and FREE_PORT
export ASCEND_MF_STORE_URL="tcp://PREFILL_IP:FREE_PORT"
python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend ascend \
    --attention-backend ascend \
    --device npu \
    --base-gpu-id 1 \
    --tp-size 1 \
    --host 127.0.0.1 \
    --port 30002

Command

# PREFILL_IP: IP address of the first Prefill Server
# FREE_PORT: any available port
# all SGLang servers need to be configured with the same PREFILL_IP and FREE_PORT
export ASCEND_MF_STORE_URL="tcp://PREFILL_IP:FREE_PORT"
export ASCEND_MF_TRANSFER_PROTOCOL="device_rdma"
python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend ascend \
    --attention-backend ascend \
    --device npu \
    --base-gpu-id 1 \
    --tp-size 1 \
    --host 127.0.0.1 \
    --port 30002

Launch Router

Command

python3 -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://127.0.0.1:30001 8995 \
    --decode http://127.0.0.1:30002 \
    --host 127.0.0.1 \
    --port 30000

The 8995 in command script is the disaggregation bootstrap port. It must match the --disaggregation-bootstrap-port value set on the prefill server in step 1.

Running Service For Multimodal Language Models

PD Mixed Scene

Command

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-VL-30B-A3B-Instruct \
    --host 127.0.0.1 \
    --port 30000 \
    --tp 4 \
    --device npu \
    --attention-backend ascend \
    --mm-attention-backend ascend_attn \
    --disable-radix-cache \
    --trust-remote-code \
    --enable-multimodal \
    --sampling-backend ascend

Testing the Service

Once the server prints The server is fired up and ready to roll! in the logs, it is ready to accept requests.

Which port to send requests to

The port you use depends on your deployment mode:

Scenario	Where to send requests
Non-PD (single server)	The server’s `--port` (e.g., `30000` in the examples above)
Non-PD (multi-node)	The primary node’s (`--node-rank 0`) `--port`; do not send requests to worker nodes
PD disaggregation	The router’s `--port` (e.g., `30000` in the examples above); do not send requests directly to prefill or decode servers

SGLang defaults to port 30000 when --port is not specified. The examples in this guide use explicit ports for clarity.

Health Check

Command

curl http://127.0.0.1:30000/health

A successful response returns HTTP 200 with an empty body.

Generate (Native Endpoint)

Command

curl http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "What is the capital of France?",
    "sampling_params": {"temperature": 0, "max_new_tokens": 128}
  }'

The expected output should contain “Paris”.

Chat Completions (OpenAI-Compatible)

Command

curl http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Some models return responses accompanied with thinking process content. To disable this output, configure parameters as follows:

Command

curl http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Eco-Tech/Qwen3.5-27B-w8a8-mtp",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

The expected output should contain “Paris”.

Multimodal Chat Completions

The image URL in the example below references an external resource (raw.githubusercontent.com). Make sure the server has internet access so the image can be downloaded at inference time. Alternatively, you can use a locally accessible URL or base64-encoded image data.

Command

curl http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-VL-30B-A3B-Instruct",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"}},
        {"type": "text", "text": "Describe this image."}
      ]
    }]
  }'

​Component Version Mapping For SGLang

​Obtain CANN Image

​Preparing the Running Environment

​Method 1: Installing from source with prerequisites

​Python Version

​CANN

​MemFabric-Hybrid

​MemFabric-zbal

​PyTorch and PyTorch Framework Adaptor on Ascend

​Triton on Ascend

​SGLang Kernels NPU

​DeepEP-compatible Library

​Some other dependencies

​Installing SGLang from source

​Method 2: Using Docker Image

​Obtain Image

​Create Docker

​System Settings

​CPU performance power scheme

​Disable NUMA balancing

​Prevent swapping out system memory

​Running SGLang Service

​Running Service For Large Language Models

​PD Mixed Scene

​PD Disaggregation Scene

​Running Service For Multimodal Language Models

​PD Mixed Scene

​Testing the Service

​Which port to send requests to

​Health Check

​Generate (Native Endpoint)

​Chat Completions (OpenAI-Compatible)

​Multimodal Chat Completions

Component Version Mapping For SGLang

Obtain CANN Image

Preparing the Running Environment

Method 1: Installing from source with prerequisites

Python Version

CANN

MemFabric-Hybrid

MemFabric-zbal

PyTorch and PyTorch Framework Adaptor on Ascend

Triton on Ascend

SGLang Kernels NPU

DeepEP-compatible Library

Some other dependencies

Installing SGLang from source

Method 2: Using Docker Image

Obtain Image

Create Docker

System Settings

CPU performance power scheme

Disable NUMA balancing

Prevent swapping out system memory

Running SGLang Service

Running Service For Large Language Models

PD Mixed Scene

PD Disaggregation Scene

Running Service For Multimodal Language Models

PD Mixed Scene

Testing the Service

Which port to send requests to

Health Check

Generate (Native Endpoint)

Chat Completions (OpenAI-Compatible)

Multimodal Chat Completions