Introduction
DeepSeek-R1 is a Mixture-of-Experts (MoE) large language model developed by DeepSeek, featuring 671B total parameters
with 37B active parameters. It employs Multi-head Latent Attention (MLA) and DeepSeekMoE architecture, with built-in
multi-token prediction (MTP) for speculative decoding. The model excels at reasoning, math, and code tasks through
reinforcement learning-based training.
This document demonstrates the deployment of DeepSeek-R1 on Ascend NPUs using SGLang, including single-node PD mixed
mode, multi-node PD disaggregation mode, feature configuration, and performance optimization.
This document is validated and written based on SGLang v0.5.13. The current model (DeepSeek-R1) is fully supported in
this version. To use the latest features (e.g., PD disaggregation, speculative decoding), it is recommended to use
v0.5.13 or a later version.
Supported features
| Feature | Example usage |
|---|
| Tensor Parallelism | --tp-size 16 |
| Data Parallelism | --dp-size 16 |
| Expert Parallelism | --ep-size 16 \
--moe-a2a-backend deepep \
--deepep-mode auto |
| PD Disaggregation | --disaggregation-mode prefill \
--disaggregation-transfer-backend ascend |
| Quantization | --quantization modelslim |
| NPU Graph | enabled by default; disable with --disable-cuda-graph; control range via --cuda-graph-bs or --cuda-graph-max-bs; e.g. --cuda-graph-bs 4 8 20 21 22 |
| Speculative Decoding | --speculative-algorithm NEXTN \
--speculative-num-steps 2 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 3 |
| Overlap Schedule | export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 |
| DP LM Head | --enable-dp-lm-head |
| MLAPO | export SGLANG_NPU_USE_MLAPO=1 |
| Multistream MoE | export SGLANG_NPU_USE_MULTI_STREAM=1 |
| NZ Weight Format | export SGLANG_USE_FIA_NZ=1 |
The values in the Example usage column are for illustration only. Adjust them according to your hardware, deployment
mode, and workload. For parameter details, see
Feature descriptions; for
recommended configurations for each deployment scenario, see Best practices.
For feature compatibility and conflict information between features,
see Feature Compatibility.
Prerequisites
Model weights
Before downloading model weights, check the model size to reserve enough disk space.
Ensure the available device memory exceeds the model weight size before deployment. For optimal throughput and latency,
refer to the best practice configurations which may require additional nodes or cards.
It is recommended to download the model weights to a shared directory across multiple nodes.
Installation
The Docker image requires at least 30 GB of free space. Ensure sufficient disk space before pulling images.
The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the
online platform. You can directly pull it.
Both stable releases and daily builds are available. The following command is based on the stable release tag.
For details, see Docker image versions.
Atlas 800I A3
Atlas 800I A2
docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3
docker run -itd --shm-size=16g --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci1:/dev/davinci1 \
--device=/dev/davinci2:/dev/davinci2 \
--device=/dev/davinci3:/dev/davinci3 \
--device=/dev/davinci4:/dev/davinci4 \
--device=/dev/davinci5:/dev/davinci5 \
--device=/dev/davinci6:/dev/davinci6 \
--device=/dev/davinci7:/dev/davinci7 \
--device=/dev/davinci8:/dev/davinci8 \
--device=/dev/davinci9:/dev/davinci9 \
--device=/dev/davinci10:/dev/davinci10 \
--device=/dev/davinci11:/dev/davinci11 \
--device=/dev/davinci12:/dev/davinci12 \
--device=/dev/davinci13:/dev/davinci13 \
--device=/dev/davinci14:/dev/davinci14 \
--device=/dev/davinci15:/dev/davinci15 \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3
docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b
docker run -itd --shm-size=16g --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci1:/dev/davinci1 \
--device=/dev/davinci2:/dev/davinci2 \
--device=/dev/davinci3:/dev/davinci3 \
--device=/dev/davinci4:/dev/davinci4 \
--device=/dev/davinci5:/dev/davinci5 \
--device=/dev/davinci6:/dev/davinci6 \
--device=/dev/davinci7:/dev/davinci7 \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b
- If the model weights have already been downloaded to a shared directory, use
-v to mount the model path into the
container, for example: -v /path/to/models:/models.
- Replace
${NAME} with your own container name or remove --name to use default name.
Online service deployment
Single-node online deployment
Single-node deployment completes both prefill and decode within the same node (PD mixed mode), suitable for scenarios
with limited hardware resources. This scenario is already covered in the best practice. For the complete, optimized
deployment commands and benchmark data, see
DeepSeek-R1 Best Practice — W4A8 8P PD Mixed On A3.
Multi-node PD disaggregation deployment
PD disaggregation splits the prefill and decode stages onto separate nodes, reducing interference and improving
throughput for high-concurrency scenarios. This scenario is already covered in the best practice. For the complete, optimized
deployment commands and benchmark data, see
DeepSeek-R1 Best Practice — W4A8 16P PD Disaggregation On A3.
Functional verification
After the service is started, you can invoke the model by sending a prompt:
# ============================================================
# Before running, update the following variables:
# HOST: the server host address (e.g., localhost)
# PORT: the server port number (e.g., 6689)
# ============================================================
curl http://${HOST}:${PORT}/generate \
-H "Content-Type: application/json" \
-d '{
"text": "What is the capital of France?",
"sampling_params": {
"max_new_tokens": 64,
"temperature": 0
}
}'
Expected result: an HTTP 200 response with the generated text containing “Paris”.
Once the server prints The server is fired up and ready to roll! in the logs, it is ready to accept requests. For more
testing examples (Health Check, Generate, Chat Completions, and port usage guidance),
see Testing the Service.
Accuracy evaluation
For accuracy evaluation methods and datasets, see Accuracy Evaluation on Ascend NPU.
For performance data and benchmark commands, see Performance Testing on Ascend NPU.
Best practices
Best practice configuration reference
For complete optimal configurations with deployment scripts and benchmark commands, see the
DeepSeek-R1 Best Practice page.
For the full list of supported features, see Supported features. For detailed optimization
guidance, see Optimization on Ascend NPU.
FAQ
For common environment, installation, and general parameter issues, please refer to the Ascend NPU FAQ.
This section only covers model-specific issues.