Prerequisites
Supported Devices
- Atlas 800I A2 inference series (Atlas 800I A2)
- Atlas 800I A3 inference series (Atlas 800I A3)
Setup environment using container
Ensure sufficient disk space before proceeding. The Docker image requires at least 30 GB of free space. If you need to download model weights, check the model size at ModelScope to reserve enough space.
We publish both stable releases and daily builds. Choose a stable release tag (e.g., v0.5.10-npu.rc1-a3) if you prefer a validated version, or a daily build tag (e.g., main-cann8.5.0-a3) if you need the latest development changes.
Atlas 800I A3
Atlas 800I A2
# Stable release
export IMAGE=quay.io/ascend/sglang:v0.5.10-npu.rc1-a3
# Daily build
export IMAGE=quay.io/ascend/sglang:main-cann8.5.0-a3
docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
--device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
--device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
--device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
--device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--volume /usr/local/sbin:/usr/local/sbin \
--volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
--volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--volume /etc/ascend_install.info:/etc/ascend_install.info \
--volume /var/queue_schedule:/var/queue_schedule \
--volume ~/.cache/:/root/.cache/ \
--entrypoint=bash \
$IMAGE
# Stable release
export IMAGE=quay.io/ascend/sglang:v0.5.10-npu.rc1-910b
# Daily build
export IMAGE=quay.io/ascend/sglang:main-cann8.5.0-910b
docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
--device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
--device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--volume /usr/local/sbin:/usr/local/sbin \
--volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
--volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--volume /etc/ascend_install.info:/etc/ascend_install.info \
--volume /var/queue_schedule:/var/queue_schedule \
--volume ~/.cache/:/root/.cache/ \
--entrypoint=bash \
$IMAGE
Usage
The SGLang server is installed in the container by default. You can use pip show sglang to check the version.
Start SGLang server
SGLang will automatically download the model from Hugging Face.
# Set HF_ENDPOINT to a mirror site if network is not available
export HF_ENDPOINT=https://hf-mirror.com
# Set your own HF_TOKEN to download restricted models
export HF_TOKEN=<secret>
# Start SGLang server
# It may take several minutes to download the model on the first run
sglang serve --model-path Qwen/Qwen2.5-7B-Instruct --attention-backend ascend &
If you see output like the following, the server is running.
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
The server is fired up and ready to roll!
Send a test request
You can do inference using the server:
curl -X POST http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 16
}
}'
If the “text” field in the response contains “Paris”, the server is working as expected.
Stop server and exit container
The SGLang server is running as a background process. You can send a SIGINT signal to stop it.
SGLANG_PID=$(pgrep -f "sglang serve")
kill -SIGINT $SGLANG_PID
The output should be like the following:
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [25310]
The server has now stopped. You can verify it with ps -ef | grep sglang, then exit the container by pressing Ctrl+D.