> ## Documentation Index > Fetch the complete documentation index at: https://docs.sglang.io/llms.txt > Use this file to discover all available pages before exploring further. # SGLang installation with NPUs support You can install SGLang using any of the methods below. Please go through `System Settings` section to ensure the clusters are roaring at max performance. Feel free to leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) if you encounter any issues or have any problems. ## Component Version Mapping For SGLang

Component	Version	Obtain Way
HDK	25.5.2	link
CANN	8.5.0	Obtain Images
Pytorch Adapter	7.3.0	link
MemFabric	1.0.5	`pip install memfabric-hybrid==1.0.5`
Triton	3.2.0	`pip install triton-ascend`
SGLang NPU Kernel	NA	link

### Obtain CANN Image Ensure sufficient disk space before pulling images. Each Docker image requires at least **30 GB** of free space. You can obtain the dependency of a specified version of CANN through an image. ```bash Command theme={null} docker pull quay.io/ascend/cann:8.5.0-a3-ubuntu22.04-py3.11 ``` ```bash Command theme={null} docker pull quay.io/ascend/cann:8.5.0-910b-ubuntu22.04-py3.11 ``` ## Preparing the Running Environment ### Method 1: Installing from source with prerequisites #### Python Version **Only `python==3.11` is supported currently**. If you don't want to break system pre-installed python, try installing with [conda](https://github.com/conda/conda). ```bash Command theme={null} conda create --name sglang_npu python=3.11 conda activate sglang_npu ``` Note on Anaconda repository restrictions If you encounter an error like “Terms of Service have not been accepted” during the conda create step, the default Anaconda repository is blocking package downloads. To resolve this, configure a mirror (e.g., Tsinghua Open Source Mirror): ```bash Command theme={null} # Add Tsinghua mirrors conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/ conda config --set show_channel_urls yes # Edit the system-level conda config to remove any hardcoded defaults vi /root/miniconda3/.condarc ``` Inside /root/miniconda3/.condarc, delete or comment out any lines containing defaults or official Anaconda URLs. Then remove the failed environment and recreate it: ```bash Command theme={null} conda clean -i conda env remove -n sglang_npu conda create --name sglang_npu python=3.11 conda activate sglang_npu ``` #### CANN Prior to start work with SGLang on Ascend you need to install CANN Toolkit, Kernels operator package and NNAL version 8.5.0, check the [installation guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850/softwareinst/instg/instg_0008.html?Mode=PmIns\&InstallType=local\&OS=openEuler\&Software=cannToolKit) #### MemFabric-Hybrid If you want to use PD disaggregation mode, you need to install MemFabric-Hybrid. MemFabric-Hybrid is a drop-in replacement of Mooncake Transfer Engine that enables KV cache transfer on Ascend NPU clusters. ```bash Command theme={null} pip install memfabric-hybrid==1.0.5 ``` #### Pytorch and Pytorch Framework Adaptor on Ascend ```bash Command theme={null} PYTORCH_VERSION=2.8.0 TORCHVISION_VERSION=0.23.0 TORCH_NPU_VERSION=2.8.0.post2 pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu pip install torch_npu==$TORCH_NPU_VERSION ``` If you are using other versions of `torch` and install `torch_npu`, check [installation guide](https://github.com/Ascend/pytorch/blob/master/README.md) #### Triton on Ascend We provide our own implementation of Triton for Ascend. ```bash Command theme={null} pip install triton-ascend ``` For installation of Triton on Ascend nightly builds or from sources, follow [installation guide](https://gitcode.com/Ascend/triton-ascend/blob/master/docs/sources/getting-started/installation.md) #### SGLang Kernels NPU We provide SGL kernels for Ascend NPU, check [installation guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/sgl_kernel_npu/README.md). #### DeepEP-compatible Library We provide a DeepEP-compatible Library as a drop-in replacement of deepseek-ai's DeepEP library, check the [installation guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README.md). #### Some other dependencies ```bash Command theme={null} # libGL apt update apt install libgl1 libglib2.0-0 # ensure setuptools contains pkg_resources module pip install "setuptools<80" ``` #### Installing SGLang from source ```bash Command theme={null} # Use the last release branch git clone https://github.com/sgl-project/sglang.git cd sglang mv python/pyproject_npu.toml python/pyproject.toml pip install -e python[all_npu] ``` ### Method 2: Using Docker Image #### Obtain Image You can download the SGLang image or build an image based on Dockerfile to obtain the Ascend NPU image. Ensure sufficient disk space before pulling images. Each Docker image requires at least **30 GB** of free space. If you need to download model weights, check the model size at [ModelScope](https://www.modelscope.cn/models) to reserve enough space. 1. Download SGLang image We publish both **stable releases** and **daily builds**. Choose a stable release tag (e.g., `v0.5.10-npu.rc1-a3`) if you prefer a validated version, or a daily build tag (e.g., `main-cann8.5.0-a3`) if you need the latest development changes. ```bash Command theme={null} # Stable release docker pull quay.io/ascend/sglang:v0.5.10-npu.rc1-a3 # Daily build docker pull quay.io/ascend/sglang:main-cann8.5.0-a3 ``` ```bash Command theme={null} # Stable release docker pull quay.io/ascend/sglang:v0.5.10-npu.rc1-910b # Daily build docker pull quay.io/ascend/sglang:main-cann8.5.0-910b ``` 2. Build an image based on Dockerfile ```bash Command theme={null} # Clone the SGLang repository git clone https://github.com/sgl-project/sglang.git cd sglang/docker # Build the docker image # If there are network errors, please modify the Dockerfile to use offline dependencies or use a proxy # is the target architecture of the image, e.g. amd64, arm64 docker build --build-arg TARGETARCH= -t -f npu.Dockerfile . ``` #### Create Docker **Notice:** `--privileged` and `--network=host` are required by RDMA, which is typically needed by Ascend NPU clusters. ```bash Command theme={null} alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \ --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \ --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \ --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \ --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \ --device=/dev/davinci_manager --device=/dev/hisi_hdc \ --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \ --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ --volume /etc/ascend_install.info:/etc/ascend_install.info \ --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/' # Add HF_TOKEN env for download model by SGLang. drun --env "HF_TOKEN=" \ \ python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend ``` ```bash Command theme={null} alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \ --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \ --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \ --device=/dev/davinci_manager --device=/dev/hisi_hdc \ --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \ --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ --volume /etc/ascend_install.info:/etc/ascend_install.info \ --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/' # Add HF_TOKEN env for download model by SGLang. drun --env "HF_TOKEN=" \ \ python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend ``` SGLang will serve on `http://127.0.0.1:30000` by default. You can change the host and port by `--host` and `--port` parameters. ## System Settings ### CPU performance power scheme The default power scheme on Ascend hardware is `ondemand` which could affect performance, changing it to `performance` is recommended. ```bash Command theme={null} echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor # Make sure changes are applied successfully cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # shows performance ``` ### Disable NUMA balancing ```bash Command theme={null} sudo sysctl -w kernel.numa_balancing=0 # Check cat /proc/sys/kernel/numa_balancing # shows 0 ``` ### Prevent swapping out system memory ```bash Command theme={null} sudo sysctl -w vm.swappiness=10 # Check cat /proc/sys/vm/swappiness # shows 10 ``` ## Running SGLang Service ### Running Service For Large Language Models #### PD Mixed Scene ```bash Command theme={null} # Enabling CPU Affinity export SGLANG_SET_CPU_AFFINITY=1 python3 -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --attention-backend ascend \ --host 127.0.0.1 \ --port 8000 ``` #### PD Disaggregation Scene 1. Launch Prefill Server ```bash Command theme={null} # Enabling CPU Affinity export SGLANG_SET_CPU_AFFINITY=1 # PREFILL_IP: IP address of the first Prefill Server # FREE_PORT: any available port # all SGLang servers need to be configured with the same PREFILL_IP and FREE_PORT export ASCEND_MF_STORE_URL="tcp://PREFILL_IP:FREE_PORT" python3 -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --disaggregation-mode prefill \ --disaggregation-transfer-backend ascend \ --disaggregation-bootstrap-port 8995 \ --attention-backend ascend \ --device npu \ --base-gpu-id 0 \ --tp-size 1 \ --host 127.0.0.1 \ --port 8000 ``` ```bash Command theme={null} # Enabling CPU Affinity export SGLANG_SET_CPU_AFFINITY=1 # PREFILL_IP: IP address of the first Prefill Server # FREE_PORT: any available port # all SGLang servers need to be configured with the same PREFILL_IP and FREE_PORT export ASCEND_MF_STORE_URL="tcp://PREFILL_IP:FREE_PORT" export ASCEND_MF_TRANSFER_PROTOCOL="device_rdma" python3 -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --disaggregation-mode prefill \ --disaggregation-transfer-backend ascend \ --disaggregation-bootstrap-port 8995 \ --attention-backend ascend \ --device npu \ --base-gpu-id 0 \ --tp-size 1 \ --host 127.0.0.1 \ --port 8000 ``` 2. Launch Decode Server ```bash Command theme={null} # PREFILL_IP: IP address of the first Prefill Server # FREE_PORT: any available port # all SGLang servers need to be configured with the same PREFILL_IP and FREE_PORT export ASCEND_MF_STORE_URL="tcp://PREFILL_IP:FREE_PORT" python3 -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --disaggregation-mode decode \ --disaggregation-transfer-backend ascend \ --attention-backend ascend \ --device npu \ --base-gpu-id 1 \ --tp-size 1 \ --host 127.0.0.1 \ --port 8001 ``` ```bash Command theme={null} # PREFILL_IP: IP address of the first Prefill Server # FREE_PORT: any available port # all SGLang servers need to be configured with the same PREFILL_IP and FREE_PORT export ASCEND_MF_STORE_URL="tcp://PREFILL_IP:FREE_PORT" export ASCEND_MF_TRANSFER_PROTOCOL="device_rdma" python3 -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --disaggregation-mode decode \ --disaggregation-transfer-backend ascend \ --attention-backend ascend \ --device npu \ --base-gpu-id 1 \ --tp-size 1 \ --host 127.0.0.1 \ --port 8001 ``` 3. Launch Router ```bash Command theme={null} python3 -m sglang_router.launch_router \ --pd-disaggregation \ --policy cache_aware \ --prefill http://127.0.0.1:8000 8995 \ --decode http://127.0.0.1:8001 \ --host 127.0.0.1 \ --port 6688 ``` ### Running Service For Multimodal Language Models #### PD Mixed Scene ```bash Command theme={null} python3 -m sglang.launch_server \ --model-path Qwen3-VL-30B-A3B-Instruct \ --host 127.0.0.1 \ --port 8000 \ --tp 4 \ --device npu \ --attention-backend ascend \ --mm-attention-backend ascend_attn \ --disable-radix-cache \ --trust-remote-code \ --enable-multimodal \ --sampling-backend ascend ``` ## Testing the Service Once the server prints `The server is fired up and ready to roll!` in the logs, it is ready to accept requests. ### Which port to send requests to The port you use depends on your deployment mode: | Scenario | Where to send requests | | ---------------------- | -------------------------------------------------------------------------------------------------------------------------- | | Non-PD (single server) | The server's `--port` (e.g., `8000` in the examples above) | | Non-PD (multi-node) | The primary node's (`--node-rank 0`) `--port`; do **not** send requests to worker nodes | | PD disaggregation | The router's `--port` (e.g., `6688` in the examples above); do **not** send requests directly to prefill or decode servers | SGLang serves on port `30000` by default if `--port` is not specified. The examples in this guide use explicit ports for clarity. If you are using PD disaggregation, replace `8000` with your router's port (e.g., `6688`) in the following examples. ### Health Check ```bash Command theme={null} curl http://127.0.0.1:8000/health ``` A successful response returns HTTP 200 with an empty body. ### Generate (Native Endpoint) ```bash Command theme={null} curl http://127.0.0.1:8000/generate \ -H "Content-Type: application/json" \ -d '{ "text": "What is the capital of France?", "sampling_params": {"temperature": 0, "max_new_tokens": 128} }' ``` The expected output should contain "Paris". ### Chat Completions (OpenAI-Compatible) ```bash Command theme={null} curl http://127.0.0.1:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}] }' ``` The expected output should contain "Paris". ### Multimodal Chat Completions The image URL in the example below references an external resource (`raw.githubusercontent.com`). Make sure the server has internet access so the image can be downloaded at inference time. Alternatively, you can use a locally accessible URL or base64-encoded image data. ```bash Command theme={null} curl http://127.0.0.1:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3-VL-30B-A3B-Instruct", "messages": [{ "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"}}, {"type": "text", "text": "Describe this image."} ] }] }' ```