Skip to main content
This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue. If you want to know the meaning and usage of each parameter, click Server Arguments.

Model and tokenizer

ArgumentDefaultsOptionsServer supported
--model-path
--model
NoneType: strA2, A3
--tokenizer-pathNoneType: strA2, A3
--tokenizer-modeautoauto, slowA2, A3
--tokenizer-worker-num1Type: intA2, A3
--skip-tokenizer-initFalsebool flag (set to enable)A2, A3
--load-formatautoauto, safetensors, ggufA2, A3
--model-loader-
extra-config
{}Type: strA2, A3
--trust-remote-codeFalsebool flag (set to enable)A2, A3
--context-lengthNoneType: intA2, A3
--is-embeddingFalsebool flag (set to enable)A2, A3
--enable-multimodalNonebool flag (set to enable)A2, A3
--revisionNoneType: strA2, A3
--model-implautoauto, sglang,
transformers
A2, A3

HTTP server

ArgumentDefaultsOptionsServer supported
--host127.0.0.1Type: strA2, A3
--port30000Type: intA2, A3
--skip-server-warmupFalsebool flag (set to enable)A2, A3
--warmupsNoneType: strA2, A3
--nccl-portNoneType: intA2, A3
--fastapi-root-pathNoneType: strA2, A3
--grpc-modeFalseFalsePlanned

SSL/TLS

ArgumentDefaultsOptionsServer supported
--ssl-keyfileNoneType: strA2, A3
--ssl-certfileNoneType: strA2, A3
--ssl-keyfile-passwordNoneType: strA2, A3
--enable-ssl-refreshFalsebool flag
(set to enable)
A2, A3
--enable-http2Falsebool flag
(set to enable)
A2, A3

Quantization and data type

ArgumentDefaultsOptionsServer supported
--dtypeautoauto,
float16,
bfloat16
A2, A3
--quantizationNonemodelslimA2, A3
--quantization-param-pathNoneType: strSpecial For GPU
--kv-cache-dtypeautoautoA2, A3
--enable-fp32-lm-headFalsebool flag
(set to enable)
A2, A3
--modelopt-quantNoneType: strSpecial For GPU
--modelopt-checkpoint-
restore-path
NoneType: strSpecial For GPU
--modelopt-checkpoint-
save-path
NoneType: strSpecial For GPU
--modelopt-export-pathNoneType: strSpecial For GPU
--quantize-and-serveFalsebool flag
(set to enable)
Special For GPU
--rl-quant-profileNoneType: strSpecial For GPU

Memory and scheduling

ArgumentDefaultsOptionsServer supported
--mem-fraction-staticNoneType: floatA2, A3
--max-running-requestsNoneType: intA2, A3
--prefill-max-requestsNoneType: intA2, A3
--max-queued-requestsNoneType: intA2, A3
--max-total-tokensNoneType: intA2, A3
--chunked-prefill-sizeNoneType: intA2, A3
--max-prefill-tokens16384Type: intA2, A3
--schedule-policyfcfslpm, fcfsA2, A3
--enable-priority-
scheduling
Falsebool flag
(set to enable)
A2, A3
--disable-priority-preemptionFalsebool flag
(set to enable)
A2, A3
--default-priority-valueNoneType: intA2, A3
--schedule-low-priority-
values-first
Falsebool flag
(set to enable)
A2, A3
--priority-scheduling-
preemption-threshold
10Type: intA2, A3
--schedule-conservativeness1.0Type: floatA2, A3
--page-size128Type: intA2, A3
--swa-full-tokens-ratio0.8Type: floatPlanned
--disable-hybrid-swa-memoryFalsebool flag
(set to enable)
Planned
—radix-eviction-policylrulru,
lfu
A2, A3
—enable-prefill-delayerFalsebool flag
(set to enable)
A2, A3
—prefill-delayer-max-delay-passes30Type: intA2, A3
—prefill-delayer-token-usage-low-watermarkNoneType: floatA2, A3
—prefill-delayer-forward-passes-bucketsNoneList[float]A2, A3
—prefill-delayer-wait-seconds-bucketsNoneList[float]A2, A3
—abort-on-priority-
when-disabled
Falsebool flag
(set to enable)
A2, A3
--enable-dynamic-chunkingFalsebool flag
(set to enable)
Experimental

Runtime options

ArgumentDefaultsOptionsServer supported
--deviceNoneType: strA2, A3
--tensor-parallel-size
--tp-size
1Type: intA2, A3
--pipeline-parallel-size
--pp-size
1Type: int; Currently 2 not supportedExperimental
—attention-context-parallel-size
—attn-cp-size
1Type: int; must be equal to —tp-sizeA2, A3
—moe-data-parallel-size
—moe-dp-size
1Type: intPlanned
—pp-max-micro-batch-sizeNoneType: intExperimental
—pp-async-batch-depthNoneType: intExperimental
—stream-interval1Type: intA2, A3
—incremental-streaming-outputFalsebool flag (set to enable)A2, A3
—stream-response-default-include-usageFalsebool flag (set to enable)A2, A3
—enable-streaming-sessionFalsebool flag (set to enable)A2, A3
—random-seedNoneType: intA2, A3
—constrained-json-
whitespace-pattern
NoneType: strA2, A3
—constrained-json-
disable-any-whitespace
Falsebool flag (set to enable)A2, A3
—watchdog-timeout300Type: floatA2, A3
—soft-watchdog-timeout300Type: floatA2, A3
—dist-timeoutNoneType: intA2, A3
—download-dirNoneType: strA2, A3
—model-checksumNoneType: strPlanned
—base-gpu-id0Type: intA2, A3
—gpu-id-step1Type: intA2, A3
—sleep-on-idleFalsebool flag (set to enable)A2, A3
—use-rayFalsebool flag (set to enable)A2, A3
—custom-sigquit-handlerNoneOnly for engineA2, A3

Logging

ArgumentDefaultsOptionsServer supported
--log-levelinfoType: strA2, A3
--log-level-httpNoneType: strA2, A3
--log-requestsFalsebool flag
(set to enable)
A2, A3
--log-requests-level20, 1, 2, 3A2, A3
--log-requests-formattexttext, jsonA2, A3
--crash-dump-folderNoneType: strA2, A3
--enable-metricsFalsebool flag
(set to enable)
A2, A3
--enable-mfu-metricsFalsebool flag
(set to enable)
A2, A3
--enable-metrics-for-
all-schedulers
Falsebool flag
(set to enable)
A2, A3
--tokenizer-metrics-
custom-labels-header
x-custom-labelsType: strA2, A3
--tokenizer-metrics-
allowed-custom-labels
NoneList[str]A2, A3
--extra-metric-labelsNoneType: JSON/DictA2, A3
--bucket-time-to-
first-token
NoneList[float]A2, A3
--bucket-inter-token-
latency
NoneList[float]A2, A3
--bucket-e2e-request-
latency
NoneList[float]A2, A3
--collect-tokens-
histogram
Falsebool flag
(set to enable)
A2, A3
--prompt-tokens-bucketsNoneList[str]A2, A3
--generation-tokens-bucketsNoneList[str]A2, A3
--gc-warning-threshold-secs0.0Type: floatA2, A3
--decode-log-interval40Type: intA2, A3
--enable-request-time-
stats-logging
Falsebool flag
(set to enable)
A2, A3
--kv-events-configNoneType: strSpecial for GPU
--enable-traceFalsebool flag
(set to enable)
A2, A3
--oltp-traces-endpointlocalhost:4317Type: strA2, A3
—log-requests-targetNoneType: strA2, A3
—uvicorn-access-log-exclude-prefixes[]List[str]A2, A3

RequestMetricsExporter configuration

ArgumentDefaultsOptionsServer supported
--export-metrics-to-
file
Falsebool flag
(set to enable)
A2, A3
--export-metrics-to-
file-dir
NoneType: strA2, A3
ArgumentDefaultsOptionsServer supported
--api-keyNoneType: strA2, A3
--admin-api-keyNoneType: strA2, A3
--served-model-nameNoneType: strA2, A3
--weight-versiondefaultType: strA2, A3
--chat-templateNoneType: strA2, A3
—hf-chat-template-nameNoneType: strA2, A3
—completion-templateNoneType: strA2, A3
—file-storage-pathsglang_storageType: strUnused reserved parameter
—enable-cache-reportFalsebool flag
(set to enable)
A2, A3
—reasoning-parserNonedeepseek-r1
deepseek-v3
glm45
gpt-oss
kimi
qwen3
qwen3-thinking
step3
A2, A3
—tool-call-parserNonellama3
pythonic
qwen
qwen3_coder
A2, A3
--sampling-defaultsmodelopenai, modelA2, A3

Data parallelism

ArgumentDefaultsOptionsServer supported
--data-parallel-size
--dp-size
1Type: intA2, A3
--load-balance-methodautoauto,
round_robin,
follow_bootstrap_room,
total_requests,
total_tokens
A2, A3

Multi-node distributed serving

ArgumentDefaultsOptionsServer supported
--dist-init-addr
--nccl-init-addr
NoneType: strA2, A3
--nnodes1Type: intA2, A3
--node-rank0Type: intA2, A3

Model override args

ArgumentDefaultsOptionsServer supported
--json-model-override-
args
{}Type: strA2, A3
--preferred-sampling-
params
NoneType: strA2, A3

LoRA

ArgumentDefaultsOptionsServer supported
--enable-loraFalseBool flag
(set to enable)
A2, A3
—enable-lora-overlap-loadingFalseBool flag
(set to enable)
A2, A3
—max-lora-rankNoneType: intA2, A3
—lora-target-modulesNoneallA2, A3
—lora-pathsNoneType: List[str] /
JSON objects
A2, A3
—max-loras-per-batch8Type: intA2, A3
—max-loaded-lorasNoneType: intA2, A3
—lora-eviction-policylrulru,
fifo
A2, A3
—lora-backendcsgmvtriton,
csgmv,
ascend,
torch_native
A2, A3
—experts-shared-outer-lorasNoneType: boolA2, A3
—lora-use-virtual-expertsFalsebool flag
(set to enable)
A2, A3
—lora-strict-loadingFalseType: boolA2, A3
--max-lora-chunk-size1616, 32,
64, 128
Special for GPU

Kernel Backends (Attention, Sampling, Grammar, GEMM)

ArgumentDefaultsOptionsServer supported
--attention-backendNoneascendA2, A3
--prefill-attention-backendNoneascendA2, A3
--decode-attention-backendNoneascendA2, A3
--sampling-backendNonepytorch,
ascend
A2, A3
--grammar-backendNonexgrammarA2, A3
--mm-attention-backendNoneascend_attnA2, A3
--dsa-prefill-backendflashmla_sparseflashmla_sparse,
flashmla_decode,
fa3,
tilelang,
aiter
Special for GPU
--dsa-decode-backendfa3flashmla_prefill,
flashmla_kv,
fa3,
tilelang,
aiter
Special for GPU
--fp8-gemm-backendautoauto,
deep_gemm,
flashinfer_trtllm,
flashinfer_cutlass,
flashinfer_deepgemm,
cutlass,
triton,
aiter
Special for GPU
--disable-flashinfer-
autotune
Falsebool flag
(set to enable)
Special for GPU

Speculative decoding

ArgumentDefaultsOptionsServer supported
--speculative-algorithmNoneEAGLE3,
NEXTN
A2, A3
--speculative-draft-model-path
--speculative-draft-model
NoneType: strA2, A3
--speculative-draft-model-
revision
NoneType: str,
branch name,
tag name,
commit id
A2, A3
--speculative-draft-load-formatautoauto,
dummy
A2, A3
--speculative-num-stepsNoneType: intA2, A3
--speculative-eagle-topkNoneType: intA2, A3
--speculative-num-draft-tokensNoneType: intA2, A3
--speculative-accept-
threshold-single
1.0Type: floatSpecial for GPU
--speculative-accept-
threshold-acc
1.0Type: floatSpecial for GPU
--speculative-token-mapNoneType: strA2, A3
--speculative-attention-
mode
prefillprefill,
decode
A2, A3
--speculative-moe-runner-
backend
NoneautoA2, A3
--speculative-moe-a2a-
backend
Noneascend_fuseep (the only supported value on Ascend NPU)A2, A3
--speculative-draft-attention-backendNoneascendA2, A3
--speculative-draft-model-quantizationNoneunquant (the only supported value for speculative decoding on Ascend NPU)A2, A3

Ngram speculative decoding

ArgumentDefaultsOptionsServer supported
--speculative-ngram-
min-match-window-size
1Type: intExperimental
--speculative-ngram-
max-match-window-size
12Type: intExperimental
--speculative-ngram-
min-bfs-breadth
1Type: intExperimental
--speculative-ngram-
max-bfs-breadth
10Type: intExperimental
--speculative-ngram-
match-type
BFSBFS,
PROB
Experimental. BFS uses recency-based expansion; PROB uses frequency-based expansion.
—speculative-ngram-
max-trie-depth
18Type: intExperimental
--speculative-ngram-
capacity
10000000Type: intExperimental
--speculative-ngram-external-corpus-pathNoneType: strExperimental
--speculative-ngram-external-sam-budget0Type: intExperimental
--speculative-ngram-external-corpus-max-tokens10000000Type: intExperimental

Expert parallelism

ArgumentDefaultsOptionsServer supported
--expert-parallel-size
--ep-size
--ep
1Type: intA2, A3
--moe-a2a-backendnonenone,
deepep,
ascend_fuseep(It is incompatible with eplb)
A2, A3
--moe-runner-backendautoauto, tritonA2, A3
--flashinfer-mxfp4-
moe-precision
defaultdefault,
bf16
Special for GPU
--enable-flashinfer-
allreduce-fusion
Falsebool flag
(set to enable)
Special for GPU
--deepep-modeautonormal,
low_latency,
auto
A2, A3
--deepep-configNoneType: strSpecial for GPU
--ep-num-redundant-experts0Type: intA2, A3
--ep-dispatch-algorithmNonestatic,
dynamic,
fake
A2, A3
--init-expert-locationtrivialtrivial,
<path.pt>,
<path.json>,
<json_string>
A2, A3
--enable-eplbFalsebool flag
(set to enable)
A2, A3
--eplb-algorithmdeepseekauto,
deepseek
A2, A3
—eplb-rebalance-num-iterations1000Type: intA2, A3
—eplb-rebalance-layers-
per-chunk
NoneType: intA2, A3
—eplb-min-rebalancing-
utilization-threshold
1.0Type: floatA2, A3
—expert-distribution-
recorder-mode
Nonestat,
stat_approx,
per_pass,
per_token
A2, A3
—expert-distribution-
recorder-buffer-size
NoneType: intA2, A3
—enable-expert-distribution-
metrics
Falsebool flag (set to enable)A2, A3
—moe-dense-tp-sizeNone1A2, A3
—elastic-ep-backendNonenone, mooncakeSpecial for GPU
--mooncake-ib-deviceNoneType: strSpecial for GPU

Mamba Cache

ArgumentDefaultsOptionsServer supported
--max-mamba-cache-sizeNoneType: intA2, A3
--mamba-ssm-dtypefloat32float32,
bfloat16,
float16
A2, A3
--mamba-full-memory-ratio0.9Type: floatA2, A3
--mamba-scheduler-strategyautoauto,
no_buffer,
extra_buffer
A2, A3
--mamba-track-interval256Type: intA2, A3

Hierarchical cache

ArgumentDefaultsOptionsServer supported
--enable-hierarchical-
cache
Falsebool flag
(set to enable).
Currently, mamba cache is not supported.
A2, A3
--hicache-ratio2.0Type: floatA2, A3
--hicache-size0Type: intA2, A3
--hicache-write-policywrite_throughCurrently only write_back supportedA2, A3
—hicache-io-backendkernelkernel_ascend,
direct
A2, A3
—hicache-mem-layoutlayer_firstpage_first_direct,
page_first_kv_split
A2, A3
—hicache-storage-
backend
NonefileA2, A3
—hicache-storage-
prefetch-policy
timeoutbest_effort,
wait_complete,
timeout
Special for GPU
—hicache-storage-
backend-extra-config
NoneType: strSpecial for GPU

LMCache

ArgumentDefaultsOptionsServer supported
--enable-lmcacheFalsebool flag
(set to enable)
Special for GPU
--lmcache-config-fileNoneType: strSpecial for GPU

Diffusion LLM

ArgumentDefaultsOptionsServer supported
--dllm-algorithmNoneType: strA2, A3
--dllm-algorithm-configNoneType: strA2, A3

Offloading (must be used with --disable-cuda-graph)

ArgumentDefaultsOptionsServer supported
--cpu-offload-gb0Type: intA2, A3
--offload-group-size-1Type: int (DeepSeek only)A2, A3
--offload-num-in-group1Type: int (DeepSeek only)A2, A3
--offload-prefetch-step1Type: int (DeepSeek only)A2, A3
--offload-modecpucpu (DeepSeek only)
meta (DeepSeek only)
sharded_gpu (DeepSeek only, only support tp=1 dp>1)
A2, A3

Optimization/debug options

ArgumentDefaultsOptionsServer supported
--disable-radix-cacheFalsebool flag
(set to enable)
A2, A3
--cuda-graph-max-bsNoneType: intA2, A3
--cuda-graph-bsNoneList[int]A2, A3
--disable-cuda-graphFalsebool flag
(set to enable)
A2, A3
--disable-cuda-graph-
padding
Falsebool flag
(set to enable)
A2, A3
--enable-profile-
cuda-graph
Falsebool flag
(set to enable)
A2, A3
--enable-cudagraph-gcFalsebool flag
(set to enable)
A2, A3
--enable-nccl-nvlsFalsebool flag
(set to enable)
Special for GPU
--enable-symm-memFalsebool flag
(set to enable)
Special for GPU
--disable-flashinfer-
cutlass-moe-fp4-allgather
Falsebool flag
(set to enable)
Special for GPU
--enable-tokenizer-
batch-encode
Falsebool flag
(set to enable)
A2, A3
—disable-tokenizer-
batch-decode
Falsebool flag
(set to enable)
A2, A3
—disable-custom-
all-reduce
Falsebool flag
(set to enable)
Special for GPU
—enable-mscclppFalsebool flag
(set to enable)
Special for GPU
—enable-torch-
symm-mem
Falsebool flag
(set to enable)
Special for GPU
—disable-overlap
-schedule
Falsebool flag
(set to enable)
A2, A3
—enable-mixed-
chunk
Falsebool flag
(set to enable)
A2, A3
—enable-dp-attentionFalsebool flag
(set to enable)
A2, A3
—enable-dp-attention-local-control-broadcastFalsebool flag
(set to enable)
A2, A3
—enable-dp-lm-headFalsebool flag
(set to enable)
A2, A3
—enable-two-
batch-overlap
Falsebool flag
(set to enable)
Planned
—enable-single-
batch-overlap
Falsebool flag
(set to enable)
A2, A3
—tbo-token-
distribution-threshold
0.48Type: floatPlanned
—enable-torch-
compile
Falsebool flag
(set to enable)
A2, A3
—enable-torch-
compile-debug-mode
Falsebool flag
(set to enable)
A2, A3
—enforce-piecewise-
cuda-graph
Falsebool flag
(set to enable);
Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported.
A2, A3
—piecewise-cuda-
graph-tokens
NoneType: JSON
list
A2, A3
—piecewise-cuda-
graph-compiler
eagereagerA2, A3
—torch-compile-max-bs32Type: intA2, A3
—piecewise-cuda-
graph-max-tokens
NoneType: intA2, A3
—torchao-configType: strSpecial for GPU
—enable-nan-detectionFalsebool flag
(set to enable)
A2, A3
—enable-p2p-checkFalsebool flag
(set to enable)
Special for GPU
—triton-attention-
reduce-in-fp32
Falsebool flag
(set to enable)
Special for GPU
—triton-attention-
num-kv-splits
8Type: intSpecial for GPU
—triton-attention-
split-tile-size
NoneType: intSpecial for GPU
—delete-ckpt-
after-loading
Falsebool flag
(set to enable)
A2, A3
—enable-memory-saverFalsebool flag
(set to enable)
A2, A3
—enable-weights-
cpu-backup
Falsebool flag
(set to enable)
A2, A3
—enable-draft-weights-
cpu-backup
Falsebool flag
(set to enable)
A2, A3
—allow-auto-truncateFalsebool flag
(set to enable)
A2, A3
—enable-custom-
logit-processor
Falsebool flag
(set to enable)
A2, A3
—flashinfer-mla-
disable-ragged
Falsebool flag
(set to enable)
Special for GPU
—disable-shared-
experts-fusion
Truebool flag
(set to enable)
A2, A3
—enforce-shared-experts-fusionFalsebool flag
(set to enable)
A2, A3
—disable-chunked-
prefix-cache
Truebool flag
(set to enable)
A2, A3
—disable-fast-
image-processor
Falsebool flag
(set to enable)
A2, A3
—keep-mm-feature-
on-device
Falsebool flag
(set to enable)
A2, A3
—enable-return-
hidden-states
Falsebool flag
(set to enable)
A2, A3
—enable-return-
routed-experts
Falsebool flag
(set to enable)
A2, A3
—scheduler-recv-
interval
1Type: intA2, A3
—numa-nodeNoneList[int]A2, A3
—enable-deterministic-
inference
Falsebool flag
(set to enable)
Planned
--rl-on-policy-targetNonefsdpPlanned
--enable-layerwise-
nvtx-marker
Falsebool flag
(set to enable)
Special for GPU
--enable-attn-tp-
input-scattered
Falsebool flag
(set to enable)
Experimental
--enable-dsa-prefill-
context-parallel
Falsebool flag
(set to enable)
A2, A3
--enable-prefill-context-parallelFalsebool flag
(set to enable)
A2, A3
--prefill-cp-modein-seq-splitType: strA2, A3
--enable-fused-qk-
norm-rope
Falsebool flag
(set to enable)
Special for GPU
--enable-precise-embedding-interpolationFalsebool flag
(set to enable)
A2, A3
--gc-thresholdNoneList[int]A2, A3

Dynamic batch tokenizer

ArgumentDefaultsOptionsServer supported
--enable-dynamic-
batch-tokenizer
Falsebool flag
(set to enable)
A2, A3
--dynamic-batch-
tokenizer-batch-size
32Type: intA2, A3
--dynamic-batch-
tokenizer-batch-timeout
0.002Type: floatA2, A3

Debug tensor dumps

ArgumentDefaultsOptionsServer supported
--debug-tensor-dump-
output-folder
NoneType: strA2, A3
--debug-tensor-dump-
layers
NoneList[int]A2, A3
--debug-tensor-dump-
input-file
NoneType: strA2, A3

PD disaggregation

ArgumentDefaultsOptionsServer supported
--disaggregation-modenullnull,
prefill,
decode
A2, A3
--disaggregation-transfer-backendmooncakeascendA2, A3
--disaggregation-bootstrap-port8998Type: intA2, A3
—disaggregation-ib-deviceNoneType: strSpecial for GPU
—disaggregation-decode-
enable-offload-kvcache
FalseFalseA2, A3
—num-reserved-decode-tokens512Type: intA2, A3
—disaggregation-decode-
polling-interval
1Type: intA2, A3

Encode prefill disaggregation

ArgumentDefaultsOptionsServer supported
—enable-adaptive-dispatch-to-encoderFalsebool flag
(set to enable adaptively dispatch)
A2, A3
—encoder-onlyFalsebool flag
(set to launch an encoder-only server)
A2, A3
—language-onlyFalsebool flag
(set to load weights for the language model only)
A2, A3
—encoder-transfer-backendzmq_to_schedulerzmq_to_scheduler,
zmq_to_tokenizer,
mooncake
A2, A3
--encoder-urls[]List[str]
(List of encoder server urls)
A2, A3

Custom weight loader

ArgumentDefaultsOptionsServer supported
--custom-weight-loaderNoneList[str]A2, A3
--weight-loader-disable-
mmap
Falsebool flag
(set to enable)
A2, A3
--weight-loader-prefetch-checkpointsFalsebool flag
(set to enable)
A2, A3
--weight-loader-prefetch-num-threads4Type: intA2, A3
--remote-instance-weight-
loader-seed-instance-ip
NoneType: strSpecial for GPU
--remote-instance-weight-
loader-seed-instance-service-port
NoneType: intSpecial for GPU
--remote-instance-weight-
loader-send-weights-group-ports
NoneType: JSON
list
Special for GPU
--remote-instance-weight-
loader-backend
nccltransfer_engine,
nccl
Special for GPU
--remote-instance-weight-
loader-start-seed-via-transfer-engine
Falsebool flag
(set to enable)
Special for GPU

For PD-Multiplexing

ArgumentDefaultsOptionsServer supported
--enable-pdmuxFalsebool flag
(set to enable)
Special for GPU
--pdmux-config-pathNoneType: strSpecial for GPU
--sm-group-num8Type: intSpecial for GPU

For Multi-Modal

ArgumentDefaultsOptionsServer supported
—enable-broadcast-mm-
inputs-process
Falsebool flag
(set to enable)
A2, A3
—mm-process-configNoneType: JSON / DictA2, A3
—mm-enable-dp-encoderFalsebool flag
(set to enable)
A2, A3
—limit-mm-data-per-requestNoneType: JSON / DictA2, A3

For checkpoint decryption

ArgumentDefaultsOptionsServer supported
--decrypted-config-fileNoneType: strA2, A3
--decrypted-draft-config-fileNoneType: strA2, A3
--enable-prefix-mm-cacheFalsebool flag
(set to enable)
A2, A3

Forward hooks

ArgumentDefaultsOptionsServer supported
—forward-hooksNoneType: JSON listA2, A3

Configuration file support

ArgumentDefaultsOptionsServer supported
—configNoneType: strA2, A3

Other Params

The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.
ArgumentDefaultsOptions
--checkpoint-engine-
wait-weights-
before-ready
Falsebool flag (set to enable)
--kt-weight-pathNoneType: str
--kt-methodAMXINT4Type: str
--kt-cpuinferNoneType: int
--kt-threadpool-count2Type: int
--kt-num-gpu-expertsNoneType: int
--kt-max-deferred-
experts-per-token
NoneType: int
The following parameters have some functional deficiencies on community
ArgumentDefaultsOptions
—tool-serverNoneType: str