Ascend NPU Quickstart - SGLang Documentation

This page covers only the simplest deployment flow using the official container image. For the complete installation guide across all scenarios (source install, Docker build, system settings, version mapping, etc.), see SGLang installation with NPUs support.

Prerequisites

Supported Devices

Atlas 800I A2 inference series (Atlas 800I A2)
Atlas 800I A3 inference series (Atlas 800I A3)

To identify your device, run npu-smi info -l: A3 reports Chip Count: 2 per NPU, while A2 reports Chip Count: 1 per NPU. For hardware details, see the Ascend NPU Reference.

Docker

Ensure Docker is installed and the Docker daemon is running on your host machine. Verify with:

docker --version && docker info

If Docker is not installed, follow the official Docker installation guide for your operating system.

Setup environment using container

Ensure sufficient disk space before proceeding. Run df -h to check the available disk space. The Docker image requires at least 30GB of free space. If you need to download model weights, check the model size at ModelScope to reserve enough space.

We publish both stable releases and daily builds. Choose a stable release tag (e.g., v0.5.13.post1-cann9.0.0-a3) if you prefer a validated version, or a daily build tag (e.g., main-cann9.0.0-a3) if you need the latest development changes.

If you have already downloaded model weights to a local path (e.g., /path/to/model), mount the path into the container by adding --volume /path/to/model:/path/to/model to the docker run command below.

Atlas 800I A3
Atlas 800I A2

Command

# Choose one (uncomment the line you want):
export IMAGE=quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3       # Stable release
# export IMAGE=quay.io/ascend/sglang:main-cann9.0.0-a3               # Daily build

docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
    --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin \
    --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume /var/queue_schedule:/var/queue_schedule \
    --volume ~/.cache/:/root/.cache/ \
    --entrypoint=bash \
    $IMAGE

Command

# Choose one (uncomment the line you want):
export IMAGE=quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b       # Stable release
# export IMAGE=quay.io/ascend/sglang:main-cann9.0.0-910b               # Daily build

docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin \
    --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume /var/queue_schedule:/var/queue_schedule \
    --volume ~/.cache/:/root/.cache/ \
    --entrypoint=bash \
    $IMAGE

Usage

The SGLang server is installed in the container by default. You can use pip show sglang to check the version.

Start SGLang server

SGLang will automatically download the model from Hugging Face. If the model is already downloaded to a local path (and has been mounted into the container), use that path directly like --model-path /path/to/model.

Command

# Set HF_ENDPOINT to a mirror site if network is not available
export HF_ENDPOINT=https://hf-mirror.com

# Set your own HF_TOKEN to download restricted models
export HF_TOKEN=<secret>

# Start SGLang server
# It may take several minutes to download the model on the first run
sglang serve --model-path Qwen/Qwen2.5-7B-Instruct --attention-backend ascend &

Server startup may take several minutes. Once you see output like the following, the server is running.

Output

INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
The server is fired up and ready to roll!

Send a test request

You can do inference using the server:

Command

curl -X POST http://localhost:30000/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 16
        }
    }'

If the “text” field in the response contains “Paris”, the server is working as expected.

Stop server and exit container

The SGLang server is running as a background process. You can send a SIGINT signal to stop it.

Command

SGLANG_PID=$(pgrep -f "sglang serve")
kill -SIGINT $SGLANG_PID

Wait a moment for the server to shut down gracefully. The output should be like the following:

Output

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [<SGLANG_PID>]

The server has now stopped. You can verify it with ps -ef | grep sglang — the expected output is nothing (no matching process), then exit the container by pressing Ctrl+D.

​Prerequisites

​Supported Devices

​Docker

​Setup environment using container

​Usage

​Start SGLang server

​Send a test request

​Stop server and exit container

Prerequisites

Supported Devices

Docker

Setup environment using container

Usage

Start SGLang server

Send a test request

Stop server and exit container