Support Features on Ascend NPU#

This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue.

If you want to know the meaning and usage of each parameter, click Server Arguments.

Model and tokenizer#

Argument

Defaults

Options

Server supported

--model-path
--model

None

Type: str

A2, A3

--tokenizer-path

None

Type: str

A2, A3

--tokenizer-mode

auto

auto, slow

A2, A3

--tokenizer-worker-num

1

Type: int

A2, A3

--skip-tokenizer-init

False

bool flag (set to enable)

A2, A3

--load-format

auto

auto, safetensors

A2, A3

--model-loader-
extra-config

{}

Type: str

A2, A3

--trust-remote-code

False

bool flag (set to enable)

A2, A3

--context-length

None

Type: int

A2, A3

--is-embedding

False

bool flag (set to enable)

A2, A3

--enable-multimodal

None

bool flag (set to enable)

A2, A3

--revision

None

Type: str

A2, A3

--model-impl

auto

auto, sglang,
transformers

A2, A3

HTTP server#

Argument

Defaults

Options

Server supported

--host

127.0.0.1

Type: str

A2, A3

--port

30000

Type: int

A2, A3

--skip-server-warmup

False

bool flag (set to enable)

A2, A3

--warmups

None

Type: str

A2, A3

--nccl-port

None

Type: int

A2, A3

--fastapi-root-path

None

Type: str

A2, A3

--grpc-mode

False

bool flag (set to enable)

A2, A3

Quantization and data type#

Argument

Defaults

Options

Server supported

--dtype

auto

auto,
float16,
bfloat16

A2, A3

--quantization

None

modelslim

A2, A3

--quantization-param-path

None

Type: str

Special For GPU

--kv-cache-dtype

auto

auto

A2, A3

--enable-fp32-lm-head

False

bool flag
(set to enable)

A2, A3

--modelopt-quant

None

Type: str

Special For GPU

--modelopt-checkpoint-
restore-path

None

Type: str

Special For GPU

--modelopt-checkpoint-
save-path

None

Type: str

Special For GPU

--modelopt-export-path

None

Type: str

Special For GPU

--quantize-and-serve

False

bool flag
(set to enable)

Special For GPU

--rl-quant-profile

None

Type: str

Special For GPU

Memory and scheduling#

Argument

Defaults

Options

Server supported

--mem-fraction-static

None

Type: float

A2, A3

--max-running-requests

None

Type: int

A2, A3

--prefill-max-requests

None

Type: int

A2, A3

--max-queued-requests

None

Type: int

A2, A3

--max-total-tokens

None

Type: int

A2, A3

--chunked-prefill-size

None

Type: int

A2, A3

--max-prefill-tokens

16384

Type: int

A2, A3

--schedule-policy

fcfs

lpm, fcfs

A2, A3

--enable-priority-
scheduling

False

bool flag
(set to enable)

A2, A3

--schedule-low-priority-
values-first

False

bool flag
(set to enable)

A2, A3

--priority-scheduling-
preemption-threshold

10

Type: int

A2, A3

--schedule-conservativeness

1.0

Type: float

A2, A3

--page-size

128

Type: int

A2, A3

--swa-full-tokens-ratio

0.8

Type: float

A2, A3

--disable-hybrid-swa-memory

False

bool flag
(set to enable)

A2, A3

--abort-on-priority-
when-disabled

False

bool flag
(set to enable)

A2, A3

--enable-dynamic-chunking

False

bool flag
(set to enable)

A2, A3

Runtime options#

Argument

Defaults

Options

Server supported

--device

None

Type: str

A2, A3

--tensor-parallel-size
--tp-size

1

Type: int

A2, A3

--pipeline-parallel-size
--pp-size

1

Type: int

A2, A3

--pp-max-micro-batch-size

None

Type: int

A2, A3

--pp-async-batch-depth

None

Type: int

A2, A3

--stream-interval

1

Type: int

A2, A3

--stream-output

False

bool flag (set to enable)

A2, A3

--random-seed

None

Type: int

A2, A3

--constrained-json-
whitespace-pattern

None

Type: str

A2, A3

--constrained-json-
disable-any-whitespace

False

bool flag (set to enable)

A2, A3

--watchdog-timeout

300

Type: float

A2, A3

--soft-watchdog-timeout

300

Type: float

A2, A3

--dist-timeout

None

Type: int

A2, A3

--base-gpu-id

0

Type: int

A2, A3

--gpu-id-step

1

Type: int

A2, A3

--sleep-on-idle

False

bool flag (set to enable)

A2, A3

--custom-sigquit-handler

None

Optional[Callable]

A2, A3

Logging#

Argument

Defaults

Options

Server supported

--log-level

info

Type: str

A2, A3

--log-level-http

None

Type: str

A2, A3

--log-requests

False

bool flag
(set to enable)

A2, A3

--log-requests-level

2

0, 1, 2, 3

A2, A3

--log-requests-format

text

text, json

A2, A3

--crash-dump-folder

None

Type: str

A2, A3

--enable-metrics

False

bool flag
(set to enable)

A2, A3

--enable-metrics-for-
all-schedulers

False

bool flag
(set to enable)

A2, A3

--tokenizer-metrics-
custom-labels-header

x-custom-labels

Type: str

A2, A3

--tokenizer-metrics-
allowed-custom-labels

None

List[str]

A2, A3

--bucket-time-to-
first-token

None

List[float]

A2, A3

--bucket-inter-token-
latency

None

List[float]

A2, A3

--bucket-e2e-request-
latency

None

List[float]

A2, A3

--collect-tokens-
histogram

False

bool flag
(set to enable)

A2, A3

--prompt-tokens-buckets

None

List[str]

A2, A3

--generation-tokens-buckets

None

List[str]

A2, A3

--gc-warning-threshold-secs

0.0

Type: float

A2, A3

--decode-log-interval

40

Type: int

A2, A3

--enable-request-time-
stats-logging

False

bool flag
(set to enable)

A2, A3

--kv-events-config

None

Type: str

Special for GPU

--enable-trace

False

bool flag
(set to enable)

A2, A3

--oltp-traces-endpoint

localhost:4317

Type: str

A2, A3

RequestMetricsExporter configuration#

Argument

Defaults

Options

Server supported

--export-metrics-to-
file

False

bool flag
(set to enable)

A2, A3

--export-metrics-to-
file-dir

None

Type: str

A2, A3

Data parallelism#

Argument

Defaults

Options

Server supported

--data-parallel-size
--dp-size

1

Type: int

A2, A3

--load-balance-method

round_robin

round_robin,
total_requests,
total_tokens

A2, A3

--prefill-round-robin-balance

False

bool flag
(set to enable)

A2, A3

Multi-node distributed serving#

Argument

Defaults

Options

Server supported

--dist-init-addr
--nccl-init-addr

None

Type: str

A2, A3

--nnodes

1

Type: int

A2, A3

--node-rank

0

Type: int

A2, A3

Model override args#

Argument

Defaults

Options

Server supported

--json-model-override-
args

{}

Type: str

A2, A3

--preferred-sampling-
params

None

Type: str

A2, A3

LoRA#

Argument

Defaults

Options

Server supported

--enable-lora

False

Bool flag
(set to enable)

A2, A3

--max-lora-rank

None

Type: int

A2, A3

--lora-target-modules

None

all

A2, A3

--lora-paths

None

Type: List[str] /
JSON objects

A2, A3

--max-loras-per-batch

8

Type: int

A2, A3

--max-loaded-loras

None

Type: int

A2, A3

--lora-eviction-policy

lru

lru,
fifo

A2, A3

--lora-backend

triton

triton

A2, A3

--max-lora-chunk-size

16

16, 32,
64, 128

Special for GPU

Kernel Backends (Attention, Sampling, Grammar, GEMM)#

Argument

Defaults

Options

Server supported

--attention-backend

None

ascend

A2, A3

--prefill-attention-backend

None

ascend

A2, A3

--decode-attention-backend

None

ascend

A2, A3

--sampling-backend

None

pytorch,
ascend

A2, A3

--grammar-backend

None

xgrammar

A2, A3

--mm-attention-backend

None

ascend_attn

A2, A3

--nsa-prefill-backend

flashmla_sparse

flashmla_sparse,
flashmla_decode,
fa3,
tilelang,
aiter

Special for GPU

--nsa-decode-backend

fa3

flashmla_prefill,
flashmla_kv,
fa3,
tilelang,
aiter

Special for GPU

--fp8-gemm-backend

auto

auto,
deep_gemm,
flashinfer_trtllm,
cutlass,
triton,
aiter

Special for GPU

--disable-flashinfer-
autotune

False

bool flag
(set to enable)

Special for GPU

Speculative decoding#

Argument

Defaults

Options

Server supported

--speculative-algorithm

None

EAGLE3,
NEXTN

A2, A3

--speculative-draft-model-path
--speculative-draft-model

None

Type: str

A2, A3

--speculative-draft-model-
revision

None

Type: str

A2, A3

--speculative-draft-load-format

None

auto

A2, A3

--speculative-num-steps

None

Type: int

A2, A3

--speculative-eagle-topk

None

Type: int

A2, A3

--speculative-num-draft-tokens

None

Type: int

A2, A3

--speculative-accept-
threshold-single

1.0

Type: float

Special for GPU

--speculative-accept-
threshold-acc

1.0

Type: float

Special for GPU

--speculative-token-map

None

Type: str

A2, A3

--speculative-attention-
mode

prefill

prefill,
decode

A2, A3

--speculative-moe-runner-
backend

None

auto

A2, A3

--speculative-moe-a2a-
backend

None

ascend_fuseep

A2, A3

--speculative-draft-attention-backend

None

ascend

A2, A3

--speculative-draft-model-quantization

None

unquant

A2, A3

Ngram speculative decoding#

Argument

Defaults

Options

Server supported

--speculative-ngram-
min-match-window-size

1

Type: int

Experimental

--speculative-ngram-
max-match-window-size

12

Type: int

Experimental

--speculative-ngram-
min-bfs-breadth

1

Type: int

Experimental

--speculative-ngram-
max-bfs-breadth

10

Type: int

Experimental

--speculative-ngram-
match-type

BFS

BFS,
PROB

Experimental

--speculative-ngram-
branch-length

18

Type: int

Experimental

--speculative-ngram-
capacity

10000000

Type: int

Experimental

Expert parallelism#

Argument

Defaults

Options

Server supported

--expert-parallel-size
--ep-size
--ep

1

Type: int

A2, A3

--moe-a2a-backend

none

none,
deepep,
ascend_fuseep

A2, A3

--moe-runner-backend

auto

auto, triton

A2, A3

--flashinfer-mxfp4-
moe-precision

default

default,
bf16

Special for GPU

--enable-flashinfer-
allreduce-fusion

False

bool flag
(set to enable)

Special for GPU

--deepep-mode

auto

normal,
low_latency,
auto

A2, A3

--deepep-config

None

Type: str

Special for GPU

--ep-num-redundant-experts

0

Type: int

A2, A3

--ep-dispatch-algorithm

None

Type: str

A2, A3

--init-expert-location

trivial

Type: str

A2, A3

--enable-eplb

False

bool flag
(set to enable)

A2, A3

--eplb-algorithm

auto

Type: str

A2, A3

--eplb-rebalance-layers-
per-chunk

None

Type: int

A2, A3

--eplb-min-rebalancing-
utilization-threshold

1.0

Type: float

A2, A3

--expert-distribution-
recorder-mode

None

Type: str

A2, A3

--expert-distribution-
recorder-buffer-size

None

Type: int

A2, A3

--enable-expert-distribution-
metrics

False

bool flag (set to enable)

A2, A3

--moe-dense-tp-size

None

Type: int

A2, A3

--elastic-ep-backend

None

none, mooncake

Special for GPU

--mooncake-ib-device

None

Type: str

Special for GPU

Mamba Cache#

Argument

Defaults

Options

Server supported

--max-mamba-cache-size

None

Type: int

A2, A3

--mamba-ssm-dtype

float32

float32,
bfloat16

A2, A3

--mamba-full-memory-ratio

0.2

Type: float

A2, A3

--mamba-scheduler-strategy

auto

auto,
no_buffer,
extra_buffer

A2, A3

--mamba-track-interval

256

Type: int

A2, A3

Hierarchical cache#

Argument

Defaults

Options

Server supported

--enable-hierarchical-
cache

False

bool flag
(set to enable)

A2, A3

--hicache-ratio

2.0

Type: float

A2, A3

--hicache-size

0

Type: int

A2, A3

--hicache-write-policy

write_through

write_back,
write_through,
write_through_selective

A2, A3

--radix-eviction-policy

lru

lru, lfu

A2, A3

--hicache-io-backend

kernel

kernel_ascend,
direct

A2, A3

--hicache-mem-layout

layer_first

page_first_direct,
page_first_kv_split

A2, A3

--hicache-storage-
backend

None

file

A2, A3

--hicache-storage-
prefetch-policy

best_effort

best_effort,
wait_complete,
timeout

Special for GPU

--hicache-storage-
backend-extra-config

None

Type: str

Special for GPU

LMCache#

Argument

Defaults

Options

Server supported

--enable-lmcache

False

bool flag
(set to enable)

Special for GPU

Offloading#

Argument

Defaults

Options

Server supported

--cpu-offload-gb

0

Type: int

A2, A3

--offload-group-size

-1

Type: int

A2, A3

--offload-num-in-group

1

Type: int

A2, A3

--offload-prefetch-step

1

Type: int

A2, A3

--offload-mode

cpu

Type: str

A2, A3

Args for multi-item scoring#

Argument

Defaults

Options

Server supported

--multi-item-scoring-delimiter

None

Type: int

A2, A3

Optimization/debug options#

Argument

Defaults

Options

Server supported

--disable-radix-cache

False

bool flag
(set to enable)

A2, A3

--cuda-graph-max-bs

None

Type: int

A2, A3

--cuda-graph-bs

None

List[int]

A2, A3

--disable-cuda-graph

False

bool flag
(set to enable)

A2, A3

--disable-cuda-graph-
padding

False

bool flag
(set to enable)

A2, A3

--enable-profile-
cuda-graph

False

bool flag
(set to enable)

A2, A3

--enable-cudagraph-gc

False

bool flag
(set to enable)

A2, A3

--enable-nccl-nvls

False

bool flag
(set to enable)

Special for GPU

--enable-symm-mem

False

bool flag
(set to enable)

Special for GPU

--disable-flashinfer-
cutlass-moe-fp4-allgather

False

bool flag
(set to enable)

Special for GPU

--enable-tokenizer-
batch-encode

False

bool flag
(set to enable)

A2, A3

--disable-tokenizer-
batch-encode

False

bool flag
(set to enable)

A2, A3

--disable-outlines-
disk-cache

False

bool flag
(set to enable)

A2, A3

--disable-custom-
all-reduce

False

bool flag
(set to enable)

A2, A3

--enable-mscclpp

False

bool flag
(set to enable)

Special for GPU

--enable-torch-
symm-mem

False

bool flag
(set to enable)

Special for GPU

--disable-overlap
-schedule

False

bool flag
(set to enable)

A2, A3

--enable-mixed-
chunk

False

bool flag
(set to enable)

A2, A3

--enable-dp-attention

False

bool flag
(set to enable)

A2, A3

--enable-dp-lm-head

False

bool flag
(set to enable)

A2, A3

--enable-two-
batch-overlap

False

bool flag
(set to enable)

Planned

--enable-single-
batch-overlap

False

bool flag
(set to enable)

A2, A3

--tbo-token-
distribution-threshold

0.48

Type: float

Planned

--enable-torch-
compile

False

bool flag
(set to enable)

A2, A3

--enable-torch-
compile-debug-mode

False

bool flag
(set to enable)

A2, A3

--enable-piecewise-
cuda-graph

False

bool flag
(set to enable)

A2, A3

--piecewise-cuda-
graph-tokens

None

Type: JSON
list

A2, A3

--piecewise-cuda-
graph-compiler

eager

[“eager”, “inductor”]

A2, A3

--torch-compile-max-bs

32

Type: int

A2, A3

--piecewise-cuda-
graph-max-tokens

4096

Type: int

A2, A3

--torchao-config

``

Type: str

Special for GPU

--enable-nan-detection

False

bool flag
(set to enable)

A2, A3

--enable-p2p-check

False

bool flag
(set to enable)

Special for GPU

--triton-attention-
reduce-in-fp32

False

bool flag
(set to enable)

Special for GPU

--triton-attention-
num-kv-splits

8

Type: int

Special for GPU

--triton-attention-
split-tile-size

None

Type: int

Special for GPU

--delete-ckpt-
after-loading

False

bool flag
(set to enable)

A2, A3

--enable-memory-saver

False

bool flag
(set to enable)

A2, A3

--enable-weights-
cpu-backup

False

bool flag
(set to enable)

A2, A3

--enable-draft-weights-
cpu-backup

False

bool flag
(set to enable)

A2, A3

--allow-auto-truncate

False

bool flag
(set to enable)

A2, A3

--enable-custom-
logit-processor

False

bool flag
(set to enable)

A2, A3

--flashinfer-mla-
disable-ragged

False

bool flag
(set to enable)

Special for GPU

--disable-shared-
experts-fusion

False

bool flag
(set to enable)

A2, A3

--disable-chunked-
prefix-cache

False

bool flag
(set to enable)

A2, A3

--disable-fast-
image-processor

False

bool flag
(set to enable)

A2, A3

--keep-mm-feature-
on-device

False

bool flag
(set to enable)

A2, A3

--enable-return-
hidden-states

False

bool flag
(set to enable)

A2, A3

--enable-return-
routed-experts

False

bool flag
(set to enable)

A2, A3

--scheduler-recv-
interval

1

Type: int

A2, A3

--numa-node

None

List[int]

A2, A3

--rl-on-policy-target

None

fsdp

Planned

--enable-layerwise-
nvtx-marker

False

bool flag
(set to enable)

Special for GPU

--enable-attn-tp-
input-scattered

False

bool flag
(set to enable)

Experimental

--enable-nsa-prefill-
context-parallel

False

bool flag
(set to enable)

A2, A3

--enable-fused-qk-
norm-rope

False

bool flag
(set to enable)

Special for GPU

Dynamic batch tokenizer#

Argument

Defaults

Options

Server supported

--enable-dynamic-
batch-tokenizer

False

bool flag
(set to enable)

A2, A3

--dynamic-batch-
tokenizer-batch-size

32

Type: int

A2, A3

--dynamic-batch-
tokenizer-batch-timeout

0.002

Type: float

A2, A3

Debug tensor dumps#

Argument

Defaults

Options

Server supported

--debug-tensor-dump-
output-folder

None

Type: str

A2, A3

--debug-tensor-dump-
layers

None

List[int]

A2, A3

--debug-tensor-dump-
input-file

None

Type: str

A2, A3

PD disaggregation#

Argument

Defaults

Options

Server supported

--disaggregation-mode

null

null,
prefill,
decode

A2, A3

--disaggregation-transfer-backend

mooncake

ascend

A2, A3

--disaggregation-bootstrap-port

8998

Type: int

A2, A3

--disaggregation-decode-tp

None

Type: int

A2, A3

--disaggregation-decode-dp

None

Type: int

A2, A3

--disaggregation-ib-device

None

Type: str

Special for GPU

--disaggregation-decode-
enable-offload-kvcache

False

bool flag
(set to enable)

A2, A3

--disaggregation-decode-
enable-fake-auto

False

bool flag
(set to enable)

A2, A3

--num-reserved-decode-tokens

512

Type: int

A2, A3

--disaggregation-decode-
polling-interval

1

Type: int

A2, A3

Encode prefill disaggregation#

Argument

Defaults

Options

Server supported

--encoder-only

False

bool flag
(set to enable)

A2, A3

--language-only

False

bool flag
(set to enable)

A2, A3

--encoder-transfer-backend

zmq_to_scheduler

zmq_to_scheduler,
zmq_to_tokenizer,
mooncake

A2, A3

--encoder-urls

[]

List[str]

A2, A3

Custom weight loader#

Argument

Defaults

Options

Server supported

--custom-weight-loader

None

List[str]

A2, A3

--weight-loader-disable-
mmap

False

bool flag
(set to enable)

A2, A3

--remote-instance-weight-
loader-seed-instance-ip

None

Type: str

A2, A3

--remote-instance-weight-
loader-seed-instance-service-port

None

Type: int

A2, A3

--remote-instance-weight-
loader-send-weights-group-ports

None

Type: JSON
list

A2, A3

--remote-instance-weight-
loader-backend

nccl

transfer_engine,
nccl

A2, A3

--remote-instance-weight-
loader-start-seed-via-transfer-engine

False

bool flag
(set to enable)

Special for GPU

For PD-Multiplexing#

Argument

Defaults

Options

Server supported

--enable-pdmux

False

bool flag
(set to enable)

Special for GPU

--pdmux-config-path

None

Type: str

Special for GPU

--sm-group-num

8

Type: int

Special for GPU

For Multi-Modal#

Argument

Defaults

Options

Server supported

--mm-max-concurrent-calls

32

Type: int

A2, A3

--mm-per-request-timeout

10.0

Type: float

A2, A3

--enable-broadcast-mm-
inputs-process

False

bool flag
(set to enable)

A2, A3

--mm-process-config

None

Type: JSON / Dict

A2, A3

--mm-enable-dp-encoder

False

bool flag
(set to enable)

A2, A3

--limit-mm-data-per-request

None

Type: JSON / Dict

A2, A3

For checkpoint decryption#

Argument

Defaults

Options

Server supported

--decrypted-config-file

None

Type: str

A2, A3

--decrypted-draft-config-file

None

Type: str

A2, A3

--enable-prefix-mm-cache

False

bool flag
(set to enable)

A2, A3

For deterministic inference#

Argument

Defaults

Options

Server supported

--enable-deterministic-
inference

False

bool flag
(set to enable)

Planned

For registering hooks#

Argument

Defaults

Options

Server supported

--forward-hooks

None

Type: JSON list

A2, A3

Configuration file support#

Argument

Defaults

Options

Server supported

--config

None

Type: str

A2, A3

Other Params#

The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.

Argument

Defaults

Options

--checkpoint-engine-
wait-weights-
before-ready

False

bool flag (set to enable)

--kt-weight-path

None

Type: str

--kt-method

AMXINT4

Type: str

--kt-cpuinfer

None

Type: int

--kt-threadpool-count

2

Type: int

--kt-num-gpu-experts

None

Type: int

--kt-max-deferred-
experts-per-token

None

Type: int

The following parameters have some functional deficiencies on community

Argument

Defaults

Options

--enable-double-sparsity

False

bool flag
(set to enable)

--ds-channel-config-path

None

Type: str

--ds-heavy-channel-num

32

Type: int

--ds-heavy-token-num

256

Type: int

--ds-heavy-channel-type

qk

Type: str

--ds-sparse-decode-
threshold

4096

Type: int

--tool-server

None

Type: str