Support Features on Ascend NPU#

This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue.

If you want to know the meaning and usage of each parameter, click Server Arguments.

Model and tokenizer#

Argument	Defaults	Options	Server supported
`--model-path` `--model`	`None`	Type: str	A2, A3
`--tokenizer-path`	`None`	Type: str	A2, A3
`--tokenizer-mode`	`auto`	`auto`, `slow`	A2, A3
`--tokenizer-worker-num`	`1`	Type: int	A2, A3
`--skip-tokenizer-init`	`False`	bool flag (set to enable)	A2, A3
`--load-format`	`auto`	`auto`, `safetensors`	A2, A3
`--model-loader-` `extra-config`	{}	Type: str	A2, A3
`--trust-remote-code`	`False`	bool flag (set to enable)	A2, A3
`--context-length`	`None`	Type: int	A2, A3
`--is-embedding`	`False`	bool flag (set to enable)	A2, A3
`--enable-multimodal`	`None`	bool flag (set to enable)	A2, A3
`--revision`	`None`	Type: str	A2, A3
`--model-impl`	`auto`	`auto`, `sglang`, `transformers`	A2, A3

HTTP server#

Argument	Defaults	Options	Server supported
`--host`	`127.0.0.1`	Type: str	A2, A3
`--port`	`30000`	Type: int	A2, A3
`--skip-server-warmup`	`False`	bool flag (set to enable)	A2, A3
`--warmups`	`None`	Type: str	A2, A3
`--nccl-port`	`None`	Type: int	A2, A3
`--fastapi-root-path`	`None`	Type: str	A2, A3
`--grpc-mode`	`False`	bool flag (set to enable)	A2, A3

Quantization and data type#

Argument	Defaults	Options	Server supported
`--dtype`	`auto`	`auto`, `float16`, `bfloat16`	A2, A3
`--quantization`	`None`	`modelslim`	A2, A3
`--quantization-param-path`	`None`	Type: str	Special For GPU
`--kv-cache-dtype`	`auto`	`auto`	A2, A3
`--enable-fp32-lm-head`	`False`	bool flag (set to enable)	A2, A3
`--modelopt-quant`	`None`	Type: str	Special For GPU
`--modelopt-checkpoint-` `restore-path`	`None`	Type: str	Special For GPU
`--modelopt-checkpoint-` `save-path`	`None`	Type: str	Special For GPU
`--modelopt-export-path`	`None`	Type: str	Special For GPU
`--quantize-and-serve`	`False`	bool flag (set to enable)	Special For GPU
`--rl-quant-profile`	`None`	Type: str	Special For GPU

Memory and scheduling#

Argument	Defaults	Options	Server supported
`--mem-fraction-static`	`None`	Type: float	A2, A3
`--max-running-requests`	`None`	Type: int	A2, A3
`--prefill-max-requests`	`None`	Type: int	A2, A3
`--max-queued-requests`	`None`	Type: int	A2, A3
`--max-total-tokens`	`None`	Type: int	A2, A3
`--chunked-prefill-size`	`None`	Type: int	A2, A3
`--max-prefill-tokens`	`16384`	Type: int	A2, A3
`--schedule-policy`	`fcfs`	`lpm`, `fcfs`	A2, A3
`--enable-priority-` `scheduling`	`False`	bool flag (set to enable)	A2, A3
`--schedule-low-priority-` `values-first`	`False`	bool flag (set to enable)	A2, A3
`--priority-scheduling-` `preemption-threshold`	`10`	Type: int	A2, A3
`--schedule-conservativeness`	`1.0`	Type: float	A2, A3
`--page-size`	`128`	Type: int	A2, A3
`--swa-full-tokens-ratio`	`0.8`	Type: float	A2, A3
`--disable-hybrid-swa-memory`	`False`	bool flag (set to enable)	A2, A3
`--abort-on-priority-` `when-disabled`	`False`	bool flag (set to enable)	A2, A3
`--enable-dynamic-chunking`	`False`	bool flag (set to enable)	A2, A3

Runtime options#

Argument	Defaults	Options	Server supported
`--device`	`None`	Type: str	A2, A3
`--tensor-parallel-size` `--tp-size`	`1`	Type: int	A2, A3
`--pipeline-parallel-size` `--pp-size`	`1`	Type: int	A2, A3
`--pp-max-micro-batch-size`	`None`	Type: int	A2, A3
`--pp-async-batch-depth`	`None`	Type: int	A2, A3
`--stream-interval`	`1`	Type: int	A2, A3
`--stream-output`	`False`	bool flag (set to enable)	A2, A3
`--random-seed`	`None`	Type: int	A2, A3
`--constrained-json-` `whitespace-pattern`	`None`	Type: str	A2, A3
`--constrained-json-` `disable-any-whitespace`	`False`	bool flag (set to enable)	A2, A3
`--watchdog-timeout`	`300`	Type: float	A2, A3
`--soft-watchdog-timeout`	`300`	Type: float	A2, A3
`--dist-timeout`	`None`	Type: int	A2, A3
`--base-gpu-id`	`0`	Type: int	A2, A3
`--gpu-id-step`	`1`	Type: int	A2, A3
`--sleep-on-idle`	`False`	bool flag (set to enable)	A2, A3
`--custom-sigquit-handler`	`None`	Optional[Callable]	A2, A3

Logging#

Argument	Defaults	Options	Server supported
`--log-level`	`info`	Type: str	A2, A3
`--log-level-http`	`None`	Type: str	A2, A3
`--log-requests`	`False`	bool flag (set to enable)	A2, A3
`--log-requests-level`	`2`	`0`, `1`, `2`, `3`	A2, A3
`--log-requests-format`	text	text, json	A2, A3
`--crash-dump-folder`	`None`	Type: str	A2, A3
`--enable-metrics`	`False`	bool flag (set to enable)	A2, A3
`--enable-metrics-for-` `all-schedulers`	`False`	bool flag (set to enable)	A2, A3
`--tokenizer-metrics-` `custom-labels-header`	`x-custom-labels`	Type: str	A2, A3
`--tokenizer-metrics-` `allowed-custom-labels`	`None`	List[str]	A2, A3
`--bucket-time-to-` `first-token`	`None`	List[float]	A2, A3
`--bucket-inter-token-` `latency`	`None`	List[float]	A2, A3
`--bucket-e2e-request-` `latency`	`None`	List[float]	A2, A3
`--collect-tokens-` `histogram`	`False`	bool flag (set to enable)	A2, A3
`--prompt-tokens-buckets`	`None`	List[str]	A2, A3
`--generation-tokens-buckets`	`None`	List[str]	A2, A3
`--gc-warning-threshold-secs`	`0.0`	Type: float	A2, A3
`--decode-log-interval`	`40`	Type: int	A2, A3
`--enable-request-time-` `stats-logging`	`False`	bool flag (set to enable)	A2, A3
`--kv-events-config`	`None`	Type: str	Special for GPU
`--enable-trace`	`False`	bool flag (set to enable)	A2, A3
`--oltp-traces-endpoint`	`localhost:4317`	Type: str	A2, A3

RequestMetricsExporter configuration#

Argument	Defaults	Options	Server supported
`--export-metrics-to-` `file`	`False`	bool flag (set to enable)	A2, A3
`--export-metrics-to-` `file-dir`	`None`	Type: str	A2, A3

API related#

Argument	Defaults	Options	Server supported
`--api-key`	`None`	Type: str	A2, A3
`--admin-api-key`	`None`	Type: str	A2, A3
`--served-model-name`	`None`	Type: str	A2, A3
`--weight-version`	`default`	Type: str	A2, A3
`--chat-template`	`None`	Type: str	A2, A3
`--completion-template`	`None`	Type: str	A2, A3
`--enable-cache-report`	`False`	bool flag (set to enable)	A2, A3
`--reasoning-parser`	`None`	`deepseek-r1`	A2, A3
`--tool-call-parser`	`None`	`llama`,`pythonic`	A2, A3
`--sampling-defaults`	`model`	`openai`, `model`	A2, A3

Data parallelism#

Argument	Defaults	Options	Server supported
`--data-parallel-size` `--dp-size`	`1`	Type: int	A2, A3
`--load-balance-method`	`round_robin`	`round_robin`, `total_requests`, `total_tokens`	A2, A3
`--prefill-round-robin-balance`	`False`	bool flag (set to enable)	A2, A3

Multi-node distributed serving#

Argument	Defaults	Options	Server supported
`--dist-init-addr` `--nccl-init-addr`	`None`	Type: str	A2, A3
`--nnodes`	`1`	Type: int	A2, A3
`--node-rank`	`0`	Type: int	A2, A3

Model override args#

Argument	Defaults	Options	Server supported
`--json-model-override-` `args`	`{}`	Type: str	A2, A3
`--preferred-sampling-` `params`	`None`	Type: str	A2, A3

LoRA#

Argument	Defaults	Options	Server supported
`--enable-lora`	`False`	Bool flag (set to enable)	A2, A3
`--max-lora-rank`	`None`	Type: int	A2, A3
`--lora-target-modules`	`None`	`all`	A2, A3
`--lora-paths`	`None`	Type: List[str] / JSON objects	A2, A3
`--max-loras-per-batch`	`8`	Type: int	A2, A3
`--max-loaded-loras`	`None`	Type: int	A2, A3
`--lora-eviction-policy`	`lru`	`lru`, `fifo`	A2, A3
`--lora-backend`	`triton`	`triton`	A2, A3
`--max-lora-chunk-size`	`16`	`16`, `32`, `64`, `128`	Special for GPU

Kernel Backends (Attention, Sampling, Grammar, GEMM)#

Argument	Defaults	Options	Server supported
`--attention-backend`	`None`	`ascend`	A2, A3
`--prefill-attention-backend`	`None`	`ascend`	A2, A3
`--decode-attention-backend`	`None`	`ascend`	A2, A3
`--sampling-backend`	`None`	`pytorch`, `ascend`	A2, A3
`--grammar-backend`	`None`	`xgrammar`	A2, A3
`--mm-attention-backend`	`None`	`ascend_attn`	A2, A3
`--nsa-prefill-backend`	`flashmla_sparse`	`flashmla_sparse`, `flashmla_decode`, `fa3`, `tilelang`, `aiter`	Special for GPU
`--nsa-decode-backend`	`fa3`	`flashmla_prefill`, `flashmla_kv`, `fa3`, `tilelang`, `aiter`	Special for GPU
`--fp8-gemm-backend`	`auto`	`auto`, `deep_gemm`, `flashinfer_trtllm`, `cutlass`, `triton`, `aiter`	Special for GPU
`--disable-flashinfer-` `autotune`	`False`	bool flag (set to enable)	Special for GPU

Speculative decoding#

Argument	Defaults	Options	Server supported
`--speculative-algorithm`	`None`	`EAGLE3`, `NEXTN`	A2, A3
`--speculative-draft-model-path` `--speculative-draft-model`	`None`	Type: str	A2, A3
`--speculative-draft-model-` `revision`	`None`	Type: str	A2, A3
`--speculative-draft-load-format`	`None`	`auto`	A2, A3
`--speculative-num-steps`	`None`	Type: int	A2, A3
`--speculative-eagle-topk`	`None`	Type: int	A2, A3
`--speculative-num-draft-tokens`	`None`	Type: int	A2, A3
`--speculative-accept-` `threshold-single`	`1.0`	Type: float	Special for GPU
`--speculative-accept-` `threshold-acc`	`1.0`	Type: float	Special for GPU
`--speculative-token-map`	`None`	Type: str	A2, A3
`--speculative-attention-` `mode`	`prefill`	`prefill`, `decode`	A2, A3
`--speculative-moe-runner-` `backend`	`None`	`auto`	A2, A3
`--speculative-moe-a2a-` `backend`	`None`	`ascend_fuseep`	A2, A3
`--speculative-draft-attention-backend`	`None`	`ascend`	A2, A3
`--speculative-draft-model-quantization`	`None`	`unquant`	A2, A3

Ngram speculative decoding#

Argument	Defaults	Options	Server supported
`--speculative-ngram-` `min-match-window-size`	`1`	Type: int	Experimental
`--speculative-ngram-` `max-match-window-size`	`12`	Type: int	Experimental
`--speculative-ngram-` `min-bfs-breadth`	`1`	Type: int	Experimental
`--speculative-ngram-` `max-bfs-breadth`	`10`	Type: int	Experimental
`--speculative-ngram-` `match-type`	`BFS`	`BFS`, `PROB`	Experimental
`--speculative-ngram-` `branch-length`	`18`	Type: int	Experimental
`--speculative-ngram-` `capacity`	`10000000`	Type: int	Experimental

Expert parallelism#

Argument	Defaults	Options	Server supported
`--expert-parallel-size` `--ep-size` `--ep`	`1`	Type: int	A2, A3
`--moe-a2a-backend`	`none`	`none`, `deepep`, `ascend_fuseep`	A2, A3
`--moe-runner-backend`	`auto`	`auto`, `triton`	A2, A3
`--flashinfer-mxfp4-` `moe-precision`	`default`	`default`, `bf16`	Special for GPU
`--enable-flashinfer-` `allreduce-fusion`	`False`	bool flag (set to enable)	Special for GPU
`--deepep-mode`	`auto`	`normal`, `low_latency`, `auto`	A2, A3
`--deepep-config`	`None`	Type: str	Special for GPU
`--ep-num-redundant-experts`	`0`	Type: int	A2, A3
`--ep-dispatch-algorithm`	`None`	Type: str	A2, A3
`--init-expert-location`	`trivial`	Type: str	A2, A3
`--enable-eplb`	`False`	bool flag (set to enable)	A2, A3
`--eplb-algorithm`	`auto`	Type: str	A2, A3
`--eplb-rebalance-layers-` `per-chunk`	`None`	Type: int	A2, A3
`--eplb-min-rebalancing-` `utilization-threshold`	`1.0`	Type: float	A2, A3
`--expert-distribution-` `recorder-mode`	`None`	Type: str	A2, A3
`--expert-distribution-` `recorder-buffer-size`	`None`	Type: int	A2, A3
`--enable-expert-distribution-` `metrics`	`False`	bool flag (set to enable)	A2, A3
`--moe-dense-tp-size`	`None`	Type: int	A2, A3
`--elastic-ep-backend`	`None`	`none`, `mooncake`	Special for GPU
`--mooncake-ib-device`	`None`	Type: str	Special for GPU

Mamba Cache#

Argument	Defaults	Options	Server supported
`--max-mamba-cache-size`	`None`	Type: int	A2, A3
`--mamba-ssm-dtype`	`float32`	`float32`, `bfloat16`	A2, A3
`--mamba-full-memory-ratio`	`0.2`	Type: float	A2, A3
`--mamba-scheduler-strategy`	`auto`	`auto`, `no_buffer`, `extra_buffer`	A2, A3
`--mamba-track-interval`	`256`	Type: int	A2, A3

Hierarchical cache#

Argument	Defaults	Options	Server supported
`--enable-hierarchical-` `cache`	`False`	bool flag (set to enable)	A2, A3
`--hicache-ratio`	`2.0`	Type: float	A2, A3
`--hicache-size`	`0`	Type: int	A2, A3
`--hicache-write-policy`	`write_through`	`write_back`, `write_through`, `write_through_selective`	A2, A3
`--radix-eviction-policy`	`lru`	`lru`, `lfu`	A2, A3
`--hicache-io-backend`	`kernel`	`kernel_ascend`, `direct`	A2, A3
`--hicache-mem-layout`	`layer_first`	`page_first_direct`, `page_first_kv_split`	A2, A3
`--hicache-storage-` `backend`	`None`	`file`	A2, A3
`--hicache-storage-` `prefetch-policy`	`best_effort`	`best_effort`, `wait_complete`, `timeout`	Special for GPU
`--hicache-storage-` `backend-extra-config`	`None`	Type: str	Special for GPU

LMCache#

Argument	Defaults	Options	Server supported
`--enable-lmcache`	`False`	bool flag (set to enable)	Special for GPU

Offloading#

Argument	Defaults	Options	Server supported
`--cpu-offload-gb`	`0`	Type: int	A2, A3
`--offload-group-size`	`-1`	Type: int	A2, A3
`--offload-num-in-group`	`1`	Type: int	A2, A3
`--offload-prefetch-step`	`1`	Type: int	A2, A3
`--offload-mode`	`cpu`	Type: str	A2, A3

Args for multi-item scoring#

Argument	Defaults	Options	Server supported
`--multi-item-scoring-delimiter`	`None`	Type: int	A2, A3

Optimization/debug options#

Argument	Defaults	Options	Server supported
`--disable-radix-cache`	`False`	bool flag (set to enable)	A2, A3
`--cuda-graph-max-bs`	`None`	Type: int	A2, A3
`--cuda-graph-bs`	`None`	List[int]	A2, A3
`--disable-cuda-graph`	`False`	bool flag (set to enable)	A2, A3
`--disable-cuda-graph-` `padding`	`False`	bool flag (set to enable)	A2, A3
`--enable-profile-` `cuda-graph`	`False`	bool flag (set to enable)	A2, A3
`--enable-cudagraph-gc`	`False`	bool flag (set to enable)	A2, A3
`--enable-nccl-nvls`	`False`	bool flag (set to enable)	Special for GPU
`--enable-symm-mem`	`False`	bool flag (set to enable)	Special for GPU
`--disable-flashinfer-` `cutlass-moe-fp4-allgather`	`False`	bool flag (set to enable)	Special for GPU
`--enable-tokenizer-` `batch-encode`	`False`	bool flag (set to enable)	A2, A3
`--disable-tokenizer-` `batch-encode`	`False`	bool flag (set to enable)	A2, A3
`--disable-outlines-` `disk-cache`	`False`	bool flag (set to enable)	A2, A3
`--disable-custom-` `all-reduce`	`False`	bool flag (set to enable)	A2, A3
`--enable-mscclpp`	`False`	bool flag (set to enable)	Special for GPU
`--enable-torch-` `symm-mem`	`False`	bool flag (set to enable)	Special for GPU
`--disable-overlap` `-schedule`	`False`	bool flag (set to enable)	A2, A3
`--enable-mixed-` `chunk`	`False`	bool flag (set to enable)	A2, A3
`--enable-dp-attention`	`False`	bool flag (set to enable)	A2, A3
`--enable-dp-lm-head`	`False`	bool flag (set to enable)	A2, A3
`--enable-two-` `batch-overlap`	`False`	bool flag (set to enable)	Planned
`--enable-single-` `batch-overlap`	`False`	bool flag (set to enable)	A2, A3
`--tbo-token-` `distribution-threshold`	`0.48`	Type: float	Planned
`--enable-torch-` `compile`	`False`	bool flag (set to enable)	A2, A3
`--enable-torch-` `compile-debug-mode`	`False`	bool flag (set to enable)	A2, A3
`--enable-piecewise-` `cuda-graph`	`False`	bool flag (set to enable)	A2, A3
`--piecewise-cuda-` `graph-tokens`	`None`	Type: JSON list	A2, A3
`--piecewise-cuda-` `graph-compiler`	`eager`	[“eager”, “inductor”]	A2, A3
`--torch-compile-max-bs`	`32`	Type: int	A2, A3
`--piecewise-cuda-` `graph-max-tokens`	`4096`	Type: int	A2, A3
`--torchao-config`	``	Type: str	Special for GPU
`--enable-nan-detection`	`False`	bool flag (set to enable)	A2, A3
`--enable-p2p-check`	`False`	bool flag (set to enable)	Special for GPU
`--triton-attention-` `reduce-in-fp32`	`False`	bool flag (set to enable)	Special for GPU
`--triton-attention-` `num-kv-splits`	`8`	Type: int	Special for GPU
`--triton-attention-` `split-tile-size`	`None`	Type: int	Special for GPU
`--delete-ckpt-` `after-loading`	`False`	bool flag (set to enable)	A2, A3
`--enable-memory-saver`	`False`	bool flag (set to enable)	A2, A3
`--enable-weights-` `cpu-backup`	`False`	bool flag (set to enable)	A2, A3
`--enable-draft-weights-` `cpu-backup`	`False`	bool flag (set to enable)	A2, A3
`--allow-auto-truncate`	`False`	bool flag (set to enable)	A2, A3
`--enable-custom-` `logit-processor`	`False`	bool flag (set to enable)	A2, A3
`--flashinfer-mla-` `disable-ragged`	`False`	bool flag (set to enable)	Special for GPU
`--disable-shared-` `experts-fusion`	`False`	bool flag (set to enable)	A2, A3
`--disable-chunked-` `prefix-cache`	`False`	bool flag (set to enable)	A2, A3
`--disable-fast-` `image-processor`	`False`	bool flag (set to enable)	A2, A3
`--keep-mm-feature-` `on-device`	`False`	bool flag (set to enable)	A2, A3
`--enable-return-` `hidden-states`	`False`	bool flag (set to enable)	A2, A3
`--enable-return-` `routed-experts`	`False`	bool flag (set to enable)	A2, A3
`--scheduler-recv-` `interval`	`1`	Type: int	A2, A3
`--numa-node`	`None`	List[int]	A2, A3
`--rl-on-policy-target`	`None`	`fsdp`	Planned
`--enable-layerwise-` `nvtx-marker`	`False`	bool flag (set to enable)	Special for GPU
`--enable-attn-tp-` `input-scattered`	`False`	bool flag (set to enable)	Experimental
`--enable-nsa-prefill-` `context-parallel`	`False`	bool flag (set to enable)	A2, A3
`--enable-fused-qk-` `norm-rope`	`False`	bool flag (set to enable)	Special for GPU

Dynamic batch tokenizer#

Argument	Defaults	Options	Server supported
`--enable-dynamic-` `batch-tokenizer`	`False`	bool flag (set to enable)	A2, A3
`--dynamic-batch-` `tokenizer-batch-size`	`32`	Type: int	A2, A3
`--dynamic-batch-` `tokenizer-batch-timeout`	`0.002`	Type: float	A2, A3

Debug tensor dumps#

Argument	Defaults	Options	Server supported
`--debug-tensor-dump-` `output-folder`	`None`	Type: str	A2, A3
`--debug-tensor-dump-` `layers`	`None`	List[int]	A2, A3
`--debug-tensor-dump-` `input-file`	`None`	Type: str	A2, A3

PD disaggregation#

Argument	Defaults	Options	Server supported
`--disaggregation-mode`	`null`	`null`, `prefill`, `decode`	A2, A3
`--disaggregation-transfer-backend`	`mooncake`	`ascend`	A2, A3
`--disaggregation-bootstrap-port`	`8998`	Type: int	A2, A3
`--disaggregation-decode-tp`	`None`	Type: int	A2, A3
`--disaggregation-decode-dp`	`None`	Type: int	A2, A3
`--disaggregation-ib-device`	`None`	Type: str	Special for GPU
`--disaggregation-decode-` `enable-offload-kvcache`	`False`	bool flag (set to enable)	A2, A3
`--disaggregation-decode-` `enable-fake-auto`	`False`	bool flag (set to enable)	A2, A3
`--num-reserved-decode-tokens`	`512`	Type: int	A2, A3
`--disaggregation-decode-` `polling-interval`	`1`	Type: int	A2, A3

Encode prefill disaggregation#

Argument	Defaults	Options	Server supported
`--encoder-only`	`False`	bool flag (set to enable)	A2, A3
`--language-only`	`False`	bool flag (set to enable)	A2, A3
`--encoder-transfer-backend`	`zmq_to_scheduler`	`zmq_to_scheduler`, `zmq_to_tokenizer`, `mooncake`	A2, A3
`--encoder-urls`	`[]`	List[str]	A2, A3

Custom weight loader#

Argument	Defaults	Options	Server supported
`--custom-weight-loader`	`None`	List[str]	A2, A3
`--weight-loader-disable-` `mmap`	`False`	bool flag (set to enable)	A2, A3
`--remote-instance-weight-` `loader-seed-instance-ip`	`None`	Type: str	A2, A3
`--remote-instance-weight-` `loader-seed-instance-service-port`	`None`	Type: int	A2, A3
`--remote-instance-weight-` `loader-send-weights-group-ports`	`None`	Type: JSON list	A2, A3
`--remote-instance-weight-` `loader-backend`	`nccl`	`transfer_engine`, `nccl`	A2, A3
`--remote-instance-weight-` `loader-start-seed-via-transfer-engine`	`False`	bool flag (set to enable)	Special for GPU

For PD-Multiplexing#

Argument	Defaults	Options	Server supported
`--enable-pdmux`	`False`	bool flag (set to enable)	Special for GPU
`--pdmux-config-path`	`None`	Type: str	Special for GPU
`--sm-group-num`	`8`	Type: int	Special for GPU

For Multi-Modal#

Argument	Defaults	Options	Server supported
`--mm-max-concurrent-calls`	32	Type: int	A2, A3
`--mm-per-request-timeout`	10.0	Type: float	A2, A3
`--enable-broadcast-mm-` `inputs-process`	`False`	bool flag (set to enable)	A2, A3
`--mm-process-config`	`None`	Type: JSON / Dict	A2, A3
`--mm-enable-dp-encoder`	`False`	bool flag (set to enable)	A2, A3
`--limit-mm-data-per-request`	`None`	Type: JSON / Dict	A2, A3

For checkpoint decryption#

Argument	Defaults	Options	Server supported
`--decrypted-config-file`	`None`	Type: str	A2, A3
`--decrypted-draft-config-file`	`None`	Type: str	A2, A3
`--enable-prefix-mm-cache`	`False`	bool flag (set to enable)	A2, A3

For deterministic inference#

Argument	Defaults	Options	Server supported
`--enable-deterministic-` `inference`	`False`	bool flag (set to enable)	Planned

For registering hooks#

Argument	Defaults	Options	Server supported
`--forward-hooks`	`None`	Type: JSON list	A2, A3

Configuration file support#

Argument	Defaults	Options	Server supported
`--config`	`None`	Type: str	A2, A3

Other Params#

The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.

Argument	Defaults	Options
`--checkpoint-engine-` `wait-weights-` `before-ready`	`False`	bool flag (set to enable)
`--kt-weight-path`	`None`	Type: str
`--kt-method`	`AMXINT4`	Type: str
`--kt-cpuinfer`	`None`	Type: int
`--kt-threadpool-count`	2	Type: int
`--kt-num-gpu-experts`	`None`	Type: int
`--kt-max-deferred-` `experts-per-token`	`None`	Type: int

The following parameters have some functional deficiencies on community

Argument	Defaults	Options
`--enable-double-sparsity`	`False`	bool flag (set to enable)
`--ds-channel-config-path`	`None`	Type: str
`--ds-heavy-channel-num`	`32`	Type: int
`--ds-heavy-token-num`	`256`	Type: int
`--ds-heavy-channel-type`	`qk`	Type: str
`--ds-sparse-decode-` `threshold`	`4096`	Type: int
`--tool-server`	`None`	Type: str

Support Features on Ascend NPU

Contents