Support Features on Ascend NPU#

This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue.

If you want to know the meaning and usage of each parameter, click Server Arguments.

Model and tokenizer#

Argument

Defaults

Options

A2

A3

--model-path
--model

None

Type: str

--tokenizer-path

None

Type: str

--tokenizer-mode

auto

auto, slow

--tokenizer-worker-num

1

Type: int

--skip-tokenizer-init

False

bool flag (set to enable)

--load-format

auto

auto, safetensors

--model-loader-
extra-config

{}

Type: str

--trust-remote-code

False

bool flag (set to enable)

--context-length

None

Type: int

--is-embedding

False

bool flag (set to enable)

--enable-multimodal

None

bool flag (set to enable)

--revision

None

Type: str

×

×

--model-impl

auto

auto, sglang,
transformers

HTTP server#

Argument

Defaults

Options

A2

A3

--host

127.0.0.1

Type: str

--port

30000

Type: int

--skip-server-warmup

False

bool flag
(set to enable)

--warmups

None

Type: str

--nccl-port

None

Type: int

--fastapi-root-path

None

Type: str

×

×

--grpc-mode

False

bool flag
(set to enable)

×

×

--checkpoint-engine-
wait-weights-
before-ready

False

bool flag
(set to enable)

×

×

Quantization and data type#

Argument

Defaults

Options

A2

A3

--dtype

auto

auto,
float16,
bfloat16

--quantization

None

modelslim

--quantization-param-path

None

Type: str

×

×

--kv-cache-dtype

auto

auto

--enable-fp32-lm-head

False

bool flag
(set to enable)

×

×

--modelopt-quant

None

Type: str

×

×

--modelopt-checkpoint-
restore-path

None

Type: str

×

×

--modelopt-checkpoint-
save-path

None

Type: str

×

×

--modelopt-export-path

None

Type: str

×

×

--quantize-and-serve

False

bool flag
(set to enable)

×

×

--rl-quant-profile

None

Type: str

×

×

Memory and scheduling#

Argument

Defaults

Options

A2

A3

--mem-fraction-static

None

Type: float

--max-running-requests

None

Type: int

--prefill-max-requests

None

Type: int

--max-queued-requests

None

Type: int

--max-total-tokens

None

Type: int

--chunked-prefill-size

None

Type: int

--max-prefill-tokens

16384

Type: int

--schedule-policy

fcfs

lpm, fcfs

--enable-priority-
scheduling

False

bool flag
(set to enable)

--schedule-low-priority-
values-first

False

bool flag
(set to enable)

--priority-scheduling-
preemption-threshold

10

Type: int

--schedule-conservativeness

1.0

Type: float

--page-size

128

Type: int

--hybrid-kvcache-ratio

None

Optional[float]

×

×

--swa-full-tokens-ratio

0.8

Type: float

×

×

--disable-hybrid-swa-memory

False

bool flag
(set to enable)

×

×

--abort-on-priority-
when-disabled

False

bool flag
(set to enable)

×

×

--enable-dynamic-chunking

False

bool flag
(set to enable)

×

×

Runtime options#

Argument

Defaults

Options

A2

A3

--device

None

Type: str

--tensor-parallel-size
--tp-size

1

Type: int

--pipeline-parallel-size
--pp-size

1

Type: int

×

×

--pp-max-micro-batch-size

None

Type: int

×

×

--pp-async-batch-depth

None

Type: int

×

×

--stream-interval

1

Type: int

--stream-output

False

bool flag (set to enable)

--random-seed

None

Type: int

--constrained-json-
whitespace-pattern

None

Type: str

×

×

--constrained-json-
disable-any-whitespace

False

bool flag (set to enable)

×

×

--watchdog-timeout

300

Type: float

--soft-watchdog-timeout

300

Type: float

--dist-timeout

None

Type: int

--base-gpu-id

0

Type: int

--gpu-id-step

1

Type: int

--sleep-on-idle

False

bool flag (set to enable)

--custom-sigquit-handler

None

Optional[Callable]

×

×

Logging#

Argument

Defaults

Options

A2

A3

--log-level

info

Type: str

--log-level-http

None

Type: str

--log-requests

False

bool flag
(set to enable)

--log-requests-level

2

0, 1, 2, 3

--log-requests-format

text

text, json

×

×

--crash-dump-folder

None

Type: str

×

×

--crash-on-nan

False

Type: str

×

×

--enable-metrics

False

bool flag
(set to enable)

×

×

--enable-metrics-for-
all-schedulers

False

bool flag
(set to enable)

×

×

--tokenizer-metrics-
custom-labels-header

x-custom-labels

Type: str

×

×

--tokenizer-metrics-
allowed-custom-labels

None

List[str]

×

×

--bucket-time-to-
first-token

None

List[float]

×

×

--bucket-inter-token-
latency

None

List[float]

×

×

--bucket-e2e-request-
latency

None

List[float]

×

×

--collect-tokens-
histogram

False

bool flag
(set to enable)

×

×

--prompt-tokens-buckets

None

List[str]

×

×

--generation-tokens-buckets

None

List[str]

×

×

--gc-warning-threshold-secs

0.0

Type: float

×

×

--decode-log-interval

40

Type: int

--enable-request-time-
stats-logging

False

bool flag
(set to enable)

--kv-events-config

None

Type: str

×

×

--enable-trace

False

bool flag
(set to enable)

×

×

--oltp-traces-endpoint

localhost:4317

Type: str

×

×

RequestMetricsExporter configuration#

Argument

Defaults

Options

A2

A3

--export-metrics-to-
file

False

bool flag
(set to enable)

×

×

--export-metrics-to-
file-dir

None

Type: str

×

×

Data parallelism#

Argument

Defaults

Options

A2

A3

--data-parallel-size
--dp-size

1

Type: int

--load-balance-method

round_robin

round_robin,
shortest_queue,
minimum_tokens

--prefill-round-robin-balance

False

bool flag
(set to enable)

Multi-node distributed serving#

Argument

Defaults

Options

A2

A3

--dist-init-addr
--nccl-init-addr

None

Type: str

--nnodes

1

Type: int

--node-rank

0

Type: int

Model override args#

Argument

Defaults

Options

A2

A3

--json-model-override-
args

{}

Type: str

--preferred-sampling-
params

None

Type: str

LoRA#

Argument

Defaults

Options

A2

A3

--enable-lora

False

Bool flag
(set to enable)

--max-lora-rank

None

Type: int

--lora-target-modules

None

all

--lora-paths

None

Type: List[str] /
JSON objects

--max-loras-per-batch

8

Type: int

--max-loaded-loras

None

Type: int

--lora-eviction-policy

lru

lru,
fifo

--lora-backend

triton

triton

--max-lora-chunk-size

16

16, 32,
64, 128

×

×

Kernel Backends (Attention, Sampling, Grammar, GEMM)#

Argument

Defaults

Options

A2

A3

--attention-backend

None

ascend

--prefill-attention-backend

None

ascend

--decode-attention-backend

None

ascend

--sampling-backend

None

pytorch,
ascend

--grammar-backend

None

xgrammar

--mm-attention-backend

None

ascend_attn

--nsa-prefill-backend

flashmla_sparse

flashmla_sparse,
flashmla_decode,
fa3,
tilelang,
aiter

×

×

--nsa-decode-backend

fa3

flashmla_prefill,
flashmla_kv,
fa3,
tilelang,
aiter

×

×

--fp8-gemm-backend

auto

auto,
deep_gemm,
flashinfer_trtllm,
cutlass,
triton,
aiter

×

×

--disable-flashinfer-
autotune

False

bool flag
(set to enable)

×

×

Speculative decoding#

Argument

Defaults

Options

A2

A3

--speculative-algorithm

None

EAGLE3,
NEXTN

--speculative-draft-model-path
--speculative-draft-model

None

Type: str

--speculative-draft-model-
revision

None

Type: str

--speculative-draft-load-format

None

auto,
pt,
safetensors,
npcache,
dummy,
sharded_state,
gguf,
bitsandbytes,
layered,
flash_rl,
remote,
remote_instance,
fastsafetensors,
private

×

×

--speculative-num-steps

None

Type: int

--speculative-eagle-topk

None

Type: int

--speculative-num-draft-tokens

None

Type: int

--speculative-accept-
threshold-single

1.0

Type: float

--speculative-accept-
threshold-acc

1.0

Type: float

--speculative-token-map

None

Type: str

×

×

--speculative-attention-
mode

prefill

prefill,
decode

--speculative-moe-runner-
backend

None

auto,
deep_gemm,
triton,
triton_kernel,
flashinfer_trtllm,
flashinfer_cutlass,
flashinfer_mxfp4,
flashinfer_cutedsl,
cutlass

--speculative-moe-a2a-
backend

None

none,
deepep,
mooncake,
ascend_fuseep

--speculative-draft-attention-backend

None

Type: str

--speculative-draft-model-quantization

None

awq,
fp8,
gptq,
marlin,
gptq_marlin,
awq_marlin,
bitsandbytes,
gguf,
modelopt,
modelopt_fp8,
modelopt_fp4,
petit_nvfp4,
w8a8_int8,
w8a8_fp8,
moe_wna16,
qoq,
w4afp8,
mxfp4,
auto-round,
compressed-tensors,
modelslim,
unquant

Ngram speculative decoding#

Argument

Defaults

Options

A2

A3

--speculative-ngram-
min-match-window-size

1

Type: int

×

×

--speculative-ngram-
max-match-window-size

12

Type: int

×

×

--speculative-ngram-
min-bfs-breadth

1

Type: int

×

×

--speculative-ngram-
max-bfs-breadth

10

Type: int

×

×

--speculative-ngram-
match-type

BFS

BFS,
PROB

×

×

--speculative-ngram-
branch-length

18

Type: int

×

×

--speculative-ngram-
capacity

10000000

Type: int

×

×

Expert parallelism#

Argument

Defaults

Options

A2

A3

--expert-parallel-size
--ep-size
--ep

1

Type: int

--moe-a2a-backend

none

none,
deepep,
ascend_fuseep

--moe-runner-backend

auto

auto, triton

--flashinfer-mxfp4-
moe-precision

default

default,
bf16

×

×

--enable-flashinfer-
allreduce-fusion

False

bool flag
(set to enable)

×

×

--deepep-mode

auto

normal,
low_latency,
auto

--deepep-config

None

Type: str

×

×

--ep-num-redundant-experts

0

Type: int

×

×

--ep-dispatch-algorithm

None

Type: str

×

×

--init-expert-location

trivial

Type: str

×

×

--enable-eplb

False

bool flag
(set to enable)

×

×

--eplb-algorithm

auto

Type: str

×

×

--eplb-rebalance-layers-
per-chunk

None

Type: int

×

×

--eplb-min-rebalancing-
utilization-threshold

1.0

Type: float

×

×

--expert-distribution-
recorder-mode

None

Type: str

×

×

--expert-distribution-
recorder-buffer-size

None

Type: int

×

×

--enable-expert-distribution-
metrics

False

bool flag
(set to enable)

×

×

--moe-dense-tp-size

None

Type: int

--elastic-ep-backend

None

none, mooncake

×

×

--mooncake-ib-device

None

Type: str

×

×

Mamba Cache#

Argument

Defaults

Options

A2

A3

--max-mamba-cache-size

None

Type: int

×

×

--mamba-ssm-dtype

float32

float32,
bfloat16

×

×

--mamba-full-memory-ratio

0.2

Type: float

×

×

--mamba-scheduler-strategy

auto

auto,
no_buffer,
extra_buffer

×

×

--mamba-track-interval

256

Type: int

×

×

Hierarchical cache#

Argument

Defaults

Options

A2

A3

--enable-hierarchical-
cache

False

bool flag
(set to enable)

--hicache-ratio

2.0

Type: float

--hicache-size

0

Type: int

--hicache-write-policy

write_through

write_back,
write_through,
write_through_selective

--radix-eviction-policy

lru

lru, lfu

--hicache-io-backend

kernel

kernel_ascend,
direct

--hicache-mem-layout

layer_first

page_first_direct,
page_first_kv_split

--hicache-storage-
backend

None

file

--hicache-storage-
prefetch-policy

best_effort

best_effort,
wait_complete,
timeout

×

×

--hicache-storage-
backend-extra-config

None

Type: str

×

×

LMCache#

Argument

Defaults

Options

A2

A3

--enable-lmcache

False

bool flag
(set to enable)

×

×

Ktransformer server args#

Argument

Defaults

Options

A2

A3

--kt-weight-path

None

Type: str

×

×

--kt-method

AMXINT4

Type: str

×

×

--kt-cpuinfer

None

Type: int

×

×

--kt-threadpool-count

2

Type: int

×

×

--kt-num-gpu-experts

None

Type: int

×

×

--kt-max-deferred-
experts-per-token

None

Type: int

×

×

Double Sparsity#

Argument

Defaults

Options

A2

A3

--enable-double-sparsity

False

bool flag
(set to enable)

×

×

--ds-channel-config-path

None

Type: str

×

×

--ds-heavy-channel-num

32

Type: int

×

×

--ds-heavy-token-num

256

Type: int

×

×

--ds-heavy-channel-type

qk

Type: str

×

×

--ds-sparse-decode-
threshold

4096

Type: int

×

×

Offloading#

Argument

Defaults

Options

A2

A3

--cpu-offload-gb

0

Type: int

--offload-group-size

-1

Type: int

×

×

--offload-num-in-group

1

Type: int

×

×

--offload-prefetch-step

1

Type: int

×

×

--offload-mode

cpu

Type: str

×

×

Args for multi-item scoring#

Argument

Defaults

Options

A2

A3

--multi-item-scoring-delimiter

None

Type: int

×

×

Optimization/debug options#

Argument

Defaults

Options

A2

A3

--disable-radix-cache

False

bool flag
(set to enable)

--cuda-graph-max-bs

None

Type: int

--cuda-graph-bs

None

List[int]

--disable-cuda-graph

False

bool flag
(set to enable)

--disable-cuda-graph-
padding

False

bool flag
(set to enable)

--enable-profile-
cuda-graph

False

bool flag
(set to enable)

--enable-cudagraph-gc

False

bool flag
(set to enable)

×

×

--enable-nccl-nvls

False

bool flag
(set to enable)

×

×

--enable-symm-mem

False

bool flag
(set to enable)

×

×

--disable-flashinfer-
cutlass-moe-fp4-allgather

False

bool flag
(set to enable)

×

×

--enable-tokenizer-
batch-encode

False

bool flag
(set to enable)

--disable-tokenizer-
batch-encode

False

bool flag
(set to enable)

×

×

--disable-outlines-
disk-cache

False

bool flag
(set to enable)

--disable-custom-
all-reduce

False

bool flag
(set to enable)

--enable-mscclpp

False

bool flag
(set to enable)

×

×

--enable-torch-
symm-mem

False

bool flag
(set to enable)

×

×

--disable-overlap
-schedule

False

bool flag
(set to enable)

--enable-mixed-
chunk

False

bool flag
(set to enable)

--enable-dp-attention

False

bool flag
(set to enable)

--enable-dp-lm-head

False

bool flag
(set to enable)

--enable-two-
batch-overlap

False

bool flag
(set to enable)

×

×

--enable-single-
batch-overlap

False

bool flag
(set to enable)

×

×

--tbo-token-
distribution-threshold

0.48

Type: float

×

×

--enable-torch-
compile

False

bool flag
(set to enable)

--enable-torch-
compile-debug-mode

False

bool flag
(set to enable)

×

×

--enable-piecewise-
cuda-graph

False

bool flag
(set to enable)

×

×

--piecewise-cuda-
graph-tokens

None

Type: JSON
list

×

×

--piecewise-cuda-
graph-compiler

eager

[“eager”, “inductor”]

×

×

--torch-compile-max-bs

32

Type: int

×

×

--piecewise-cuda-
graph-max-tokens

4096

Type: int

×

×

--torchao-config

``

Type: str

×

×

--enable-nan-detection

False

bool flag
(set to enable)

×

×

--enable-p2p-check

False

bool flag
(set to enable)

×

×

--triton-attention-
reduce-in-fp32

False

bool flag
(set to enable)

×

×

--triton-attention-
num-kv-splits

8

Type: int

×

×

--triton-attention-
split-tile-size

None

Type: int

×

×

--num-continuous-
decode-steps

1

Type: int

×

×

--delete-ckpt-
after-loading

False

bool flag
(set to enable)

×

×

--enable-memory-saver

False

bool flag
(set to enable)

×

×

--enable-weights-
cpu-backup

False

bool flag
(set to enable)

×

×

--enable-draft-weights-
cpu-backup

False

bool flag
(set to enable)

×

×

--allow-auto-truncate

False

bool flag
(set to enable)

--enable-custom-
logit-processor

False

bool flag
(set to enable)

×

×

--flashinfer-mla-
disable-ragged

False

bool flag
(set to enable)

×

×

--disable-shared-
experts-fusion

False

bool flag
(set to enable)

×

×

--disable-chunked-
prefix-cache

False

bool flag
(set to enable)

×

×

--disable-fast-
image-processor

False

bool flag
(set to enable)

×

×

--keep-mm-feature-
on-device

False

bool flag
(set to enable)

×

×

--enable-return-
hidden-states

False

bool flag
(set to enable)

--enable-return-
routed-experts

False

bool flag
(set to enable)

×

×

--scheduler-recv-
interval

1

Type: int

×

×

--numa-node

None

List[int]

×

×

--rl-on-policy-target

None

fsdp

×

×

--enable-layerwise-
nvtx-marker

False

bool flag
(set to enable)

×

×

--enable-attn-tp-
input-scattered

False

bool flag
(set to enable)

×

×

--enable-nsa-prefill-
context-parallel

False

bool flag
(set to enable)

×

×

--enable-fused-qk-
norm-rope

False

bool flag
(set to enable)

×

×

Dynamic batch tokenizer#

Argument

Defaults

Options

A2

A3

--enable-dynamic-
batch-tokenizer

False

bool flag
(set to enable)

--dynamic-batch-
tokenizer-batch-size

32

Type: int

--dynamic-batch-
tokenizer-batch-timeout

0.002

Type: float

Debug tensor dumps#

Argument

Defaults

Options

A2

A3

--debug-tensor-dump-
output-folder

None

Type: str

×

×

--debug-tensor-dump-
layers

None

List[int]

×

×

--debug-tensor-dump-
input-file

None

Type: str

--debug-tensor-dump-
inject

False

Type: str

×

×

PD disaggregation#

Argument

Defaults

Options

A2

A3

--disaggregation-mode

null

null,
prefill,
decode

--disaggregation-transfer-backend

mooncake

ascend

--disaggregation-bootstrap-port

8998

Type: int

--disaggregation-decode-tp

None

Type: int

--disaggregation-decode-dp

None

Type: int

--disaggregation-prefill-pp

1

Type: int

×

×

--disaggregation-ib-device

None

Type: str

×

×

--disaggregation-decode-
enable-offload-kvcache

False

bool flag
(set to enable)

×

×

--disaggregation-decode-
enable-fake-auto

False

bool flag
(set to enable)

×

×

--num-reserved-decode-tokens

512

Type: int

--disaggregation-decode-
polling-interval

1

Type: int

Encode prefill disaggregation#

Argument

Defaults

Options

A2

A3

--encoder-only

False

bool flag
(set to enable)

×

×

--language-only

False

bool flag
(set to enable)

×

×

--encoder-transfer-backend

zmq_to_scheduler

zmq_to_scheduler,
zmq_to_tokenizer,
mooncake

×

×

--encoder-urls

[]

List[str]

×

×

Custom weight loader#

Argument

Defaults

Options

A2

A3

--custom-weight-loader

None

List[str]

×

×

--weight-loader-disable-
mmap

False

bool flag
(set to enable)

--remote-instance-weight-
loader-seed-instance-ip

None

Type: str

×

×

--remote-instance-weight-
loader-seed-instance-service-port

None

Type: int

×

×

--remote-instance-weight-
loader-send-weights-group-ports

None

Type: JSON
list

×

×

--remote-instance-weight-
loader-backend

nccl

transfer_engine,
nccl

×

×

--remote-instance-weight-
loader-start-seed-via-transfer-engine

False

bool flag
(set to enable)

×

×

For PD-Multiplexing#

Argument

Defaults

Options

A2

A3

--enable-pdmux

False

bool flag
(set to enable)

×

×

--pdmux-config-path

None

Type: str

×

×

--sm-group-num

8

Type: int

×

×

For Multi-Modal#

Argument

Defaults

Options

A2

A3

--mm-max-concurrent-calls

32

Type: int

×

×

--mm-per-request-timeout

10.0

Type: float

×

×

--enable-broadcast-mm-
inputs-process

False

bool flag
(set to enable)

×

×

--mm-process-config

None

Type: JSON / Dict

×

×

--mm-enable-dp-encoder

False

bool flag
(set to enable)

×

×

--limit-mm-data-per-request

None

Type: JSON / Dict

For checkpoint decryption#

Argument

Defaults

Options

A2

A3

--decrypted-config-file

None

Type: str

×

×

--decrypted-draft-config-file

None

Type: str

×

×

--enable-prefix-mm-cache

False

bool flag
(set to enable)

×

×

For deterministic inference#

Argument

Defaults

Options

A2

A3

--enable-deterministic-
inference

False

bool flag
(set to enable)

×

×

For registering hooks#

Argument

Defaults

Options

A2

A3

--forward-hooks

None

Type: JSON list

×

×

Configuration file support#

Argument

Defaults

Options

A2

A3

--config

None

Type: str

×

×