Support Features on Ascend NPU - SGLang Documentation

This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue. If you want to know the meaning and usage of each parameter, click Server Arguments.

Model and tokenizer

Argument	Defaults	Options	Server supported
`--model-path` `--model`	`None`	Type: str	A2, A3
`--tokenizer-path`	`None`	Type: str	A2, A3
`--tokenizer-mode`	`auto`	`auto`, `slow`	A2, A3
`--tokenizer-worker-num`	`1`	Type: int	A2, A3
`--skip-tokenizer-init`	`False`	bool flag (set to enable)	A2, A3
`--load-format`	`auto`	`auto`, `safetensors`, `gguf`	A2, A3
`--model-loader-` `extra-config`	`{}`	Type: str	A2, A3
`--trust-remote-code`	`False`	bool flag (set to enable)	A2, A3
`--context-length`	`None`	Type: int	A2, A3
`--is-embedding`	`False`	bool flag (set to enable)	A2, A3
`--enable-multimodal`	`None`	bool flag (set to enable)	A2, A3
`--revision`	`None`	Type: str	A2, A3
`--model-impl`	`auto`	`auto`, `sglang`, `transformers`	A2, A3

HTTP server

Argument	Defaults	Options	Server supported
`--host`	`127.0.0.1`	Type: str	A2, A3
`--port`	`30000`	Type: int	A2, A3
`--skip-server-warmup`	`False`	bool flag (set to enable)	A2, A3
`--warmups`	`None`	Type: str	A2, A3
`--nccl-port`	`None`	Type: int	A2, A3
`--fastapi-root-path`	`None`	Type: str	A2, A3
`--grpc-mode`	`False`	`False`	Planned

SSL/TLS

Argument	Defaults	Options	Server supported
`--ssl-keyfile`	`None`	Type: str	A2, A3
`--ssl-certfile`	`None`	Type: str	A2, A3
`--ssl-keyfile-password`	`None`	Type: str	A2, A3
`--enable-ssl-refresh`	`False`	bool flag (set to enable)	A2, A3
`--enable-http2`	`False`	bool flag (set to enable)	A2, A3

Quantization and data type

Argument	Defaults	Options	Server supported
`--dtype`	`auto`	`auto`, `float16`, `bfloat16`	A2, A3
`--quantization`	`None`	`modelslim`	A2, A3
`--quantization-param-path`	`None`	Type: str	Special For GPU
`--kv-cache-dtype`	`auto`	`auto`	A2, A3
`--enable-fp32-lm-head`	`False`	bool flag (set to enable)	A2, A3
`--modelopt-quant`	`None`	Type: str	Special For GPU
`--modelopt-checkpoint-` `restore-path`	`None`	Type: str	Special For GPU
`--modelopt-checkpoint-` `save-path`	`None`	Type: str	Special For GPU
`--modelopt-export-path`	`None`	Type: str	Special For GPU
`--quantize-and-serve`	`False`	bool flag (set to enable)	Special For GPU
`--rl-quant-profile`	`None`	Type: str	Special For GPU

Memory and scheduling

Argument	Defaults	Options	Server supported
`--mem-fraction-static`	`None`	Type: float	A2, A3
`--max-running-requests`	`None`	Type: int	A2, A3
`--prefill-max-requests`	`None`	Type: int	A2, A3
`--max-queued-requests`	`None`	Type: int	A2, A3
`--max-total-tokens`	`None`	Type: int	A2, A3
`--chunked-prefill-size`	`None`	Type: int	A2, A3
`--max-prefill-tokens`	`16384`	Type: int	A2, A3
`--schedule-policy`	`fcfs`	`lpm`, `fcfs`	A2, A3
`--enable-priority-` `scheduling`	`False`	bool flag (set to enable)	A2, A3
`--disable-priority-preemption`	`False`	bool flag (set to enable)	A2, A3
`--default-priority-value`	`None`	Type: int	A2, A3
`--schedule-low-priority-` `values-first`	`False`	bool flag (set to enable)	A2, A3
`--priority-scheduling-` `preemption-threshold`	`10`	Type: int	A2, A3
`--schedule-conservativeness`	`1.0`	Type: float	A2, A3
`--page-size`	`128`	Type: int	A2, A3
`--swa-full-tokens-ratio`	`0.8`	Type: float	Planned
`--disable-hybrid-swa-memory`	`False`	bool flag (set to enable)	Planned
`—radix-eviction-policy`	`lru`	`lru`, `lfu`	A2, A3
`—enable-prefill-delayer`	`False`	bool flag (set to enable)	A2, A3
`—prefill-delayer-max-delay-passes`	`30`	Type: int	A2, A3
`—prefill-delayer-token-usage-low-watermark`	`None`	Type: float	A2, A3
`—prefill-delayer-forward-passes-buckets`	`None`	List[float]	A2, A3
`—prefill-delayer-wait-seconds-buckets`	`None`	List[float]	A2, A3
`—abort-on-priority-` `when-disabled`	`False`	bool flag (set to enable)	A2, A3
`--enable-dynamic-chunking`	`False`	bool flag (set to enable)	Experimental

Runtime options

Argument	Defaults	Options	Server supported
`--device`	`None`	Type: str	A2, A3
`--tensor-parallel-size` `--tp-size`	`1`	Type: int	A2, A3
`--pipeline-parallel-size` `--pp-size`	`1`	Type: int; Currently `2` not supported	Experimental
`—attention-context-parallel-size` `—attn-cp-size`	`1`	Type: int; must be equal to —tp-size	A2, A3
`—moe-data-parallel-size` `—moe-dp-size`	`1`	Type: int	Planned
`—pp-max-micro-batch-size`	`None`	Type: int	Experimental
`—pp-async-batch-depth`	`None`	Type: int	Experimental
`—stream-interval`	`1`	Type: int	A2, A3
`—incremental-streaming-output`	`False`	bool flag (set to enable)	A2, A3
`—stream-response-default-include-usage`	`False`	bool flag (set to enable)	A2, A3
`—enable-streaming-session`	`False`	bool flag (set to enable)	A2, A3
`—random-seed`	`None`	Type: int	A2, A3
`—constrained-json-` `whitespace-pattern`	`None`	Type: str	A2, A3
`—constrained-json-` `disable-any-whitespace`	`False`	bool flag (set to enable)	A2, A3
`—watchdog-timeout`	`300`	Type: float	A2, A3
`—soft-watchdog-timeout`	`300`	Type: float	A2, A3
`—dist-timeout`	`None`	Type: int	A2, A3
`—download-dir`	`None`	Type: str	A2, A3
`—model-checksum`	`None`	Type: str	Planned
`—base-gpu-id`	`0`	Type: int	A2, A3
`—gpu-id-step`	`1`	Type: int	A2, A3
`—sleep-on-idle`	`False`	bool flag (set to enable)	A2, A3
`—use-ray`	`False`	bool flag (set to enable)	A2, A3
`—custom-sigquit-handler`	`None`	Only for engine	A2, A3

Logging

Argument	Defaults	Options	Server supported
`--log-level`	`info`	Type: str	A2, A3
`--log-level-http`	`None`	Type: str	A2, A3
`--log-requests`	`False`	bool flag (set to enable)	A2, A3
`--log-requests-level`	`2`	`0`, `1`, `2`, `3`	A2, A3
`--log-requests-format`	text	`text`, `json`	A2, A3
`--crash-dump-folder`	`None`	Type: str	A2, A3
`--enable-metrics`	`False`	bool flag (set to enable)	A2, A3
`--enable-mfu-metrics`	`False`	bool flag (set to enable)	A2, A3
`--enable-metrics-for-` `all-schedulers`	`False`	bool flag (set to enable)	A2, A3
`--tokenizer-metrics-` `custom-labels-header`	`x-custom-labels`	Type: str	A2, A3
`--tokenizer-metrics-` `allowed-custom-labels`	`None`	List[str]	A2, A3
`--extra-metric-labels`	`None`	Type: JSON/Dict	A2, A3
`--bucket-time-to-` `first-token`	`None`	List[float]	A2, A3
`--bucket-inter-token-` `latency`	`None`	List[float]	A2, A3
`--bucket-e2e-request-` `latency`	`None`	List[float]	A2, A3
`--collect-tokens-` `histogram`	`False`	bool flag (set to enable)	A2, A3
`--prompt-tokens-buckets`	`None`	List[str]	A2, A3
`--generation-tokens-buckets`	`None`	List[str]	A2, A3
`--gc-warning-threshold-secs`	`0.0`	Type: float	A2, A3
`--decode-log-interval`	`40`	Type: int	A2, A3
`--enable-request-time-` `stats-logging`	`False`	bool flag (set to enable)	A2, A3
`--kv-events-config`	`None`	Type: str	Special for GPU
`--enable-trace`	`False`	bool flag (set to enable)	A2, A3
`--oltp-traces-endpoint`	`localhost:4317`	Type: str	A2, A3
`—log-requests-target`	`None`	Type: str	A2, A3
`—uvicorn-access-log-exclude-prefixes`	`[]`	List[str]	A2, A3

RequestMetricsExporter configuration

Argument	Defaults	Options	Server supported
`--export-metrics-to-` `file`	`False`	bool flag (set to enable)	A2, A3
`--export-metrics-to-` `file-dir`	`None`	Type: str	A2, A3

Argument	Defaults	Options	Server supported
`--api-key`	`None`	Type: str	A2, A3
`--admin-api-key`	`None`	Type: str	A2, A3
`--served-model-name`	`None`	Type: str	A2, A3
`--weight-version`	`default`	Type: str	A2, A3
`--chat-template`	`None`	Type: str	A2, A3
`—hf-chat-template-name`	`None`	Type: str	A2, A3
`—completion-template`	`None`	Type: str	A2, A3
`—file-storage-path`	`sglang_storage`	Type: str	Unused reserved parameter
`—enable-cache-report`	`False`	bool flag (set to enable)	A2, A3
`—reasoning-parser`	`None`	`deepseek-r1` `deepseek-v3` `glm45` `gpt-oss` `kimi` `qwen3` `qwen3-thinking` `step3`	A2, A3
`—tool-call-parser`	`None`	`llama3` `pythonic` `qwen` `qwen3_coder`	A2, A3
`--sampling-defaults`	`model`	`openai`, `model`	A2, A3

Data parallelism

Argument	Defaults	Options	Server supported
`--data-parallel-size` `--dp-size`	`1`	Type: int	A2, A3
`--load-balance-method`	`auto`	`auto`, `round_robin`, `follow_bootstrap_room`, `total_requests`, `total_tokens`	A2, A3

Multi-node distributed serving

Argument	Defaults	Options	Server supported
`--dist-init-addr` `--nccl-init-addr`	`None`	Type: str	A2, A3
`--nnodes`	`1`	Type: int	A2, A3
`--node-rank`	`0`	Type: int	A2, A3

Model override args

Argument	Defaults	Options	Server supported
`--json-model-override-` `args`	`{}`	Type: str	A2, A3
`--preferred-sampling-` `params`	`None`	Type: str	A2, A3

LoRA

Argument	Defaults	Options	Server supported
`--enable-lora`	`False`	Bool flag (set to enable)	A2, A3
`—enable-lora-overlap-loading`	`False`	Bool flag (set to enable)	A2, A3
`—max-lora-rank`	`None`	Type: int	A2, A3
`—lora-target-modules`	`None`	`all`	A2, A3
`—lora-paths`	`None`	Type: List[str] / JSON objects	A2, A3
`—max-loras-per-batch`	`8`	Type: int	A2, A3
`—max-loaded-loras`	`None`	Type: int	A2, A3
`—lora-eviction-policy`	`lru`	`lru`, `fifo`	A2, A3
`—lora-backend`	`csgmv`	`triton`, `csgmv`, `ascend`, `torch_native`	A2, A3
`—experts-shared-outer-loras`	`None`	Type: bool	A2, A3
`—lora-use-virtual-experts`	`False`	bool flag (set to enable)	A2, A3
`—lora-strict-loading`	`False`	Type: bool	A2, A3
`--max-lora-chunk-size`	`16`	`16`, `32`, `64`, `128`	Special for GPU

Kernel Backends (Attention, Sampling, Grammar, GEMM)

Argument	Defaults	Options	Server supported
`--attention-backend`	`None`	`ascend`	A2, A3
`--prefill-attention-backend`	`None`	`ascend`	A2, A3
`--decode-attention-backend`	`None`	`ascend`	A2, A3
`--sampling-backend`	`None`	`pytorch`, `ascend`	A2, A3
`--grammar-backend`	`None`	`xgrammar`	A2, A3
`--mm-attention-backend`	`None`	`ascend_attn`	A2, A3
`--dsa-prefill-backend`	`flashmla_sparse`	`flashmla_sparse`, `flashmla_decode`, `fa3`, `tilelang`, `aiter`	Special for GPU
`--dsa-decode-backend`	`fa3`	`flashmla_prefill`, `flashmla_kv`, `fa3`, `tilelang`, `aiter`	Special for GPU
`--fp8-gemm-backend`	`auto`	`auto`, `deep_gemm`, `flashinfer_trtllm`, `flashinfer_cutlass`, `flashinfer_deepgemm`, `cutlass`, `triton`, `aiter`	Special for GPU
`--disable-flashinfer-` `autotune`	`False`	bool flag (set to enable)	Special for GPU

Speculative decoding

Argument	Defaults	Options	Server supported
`--speculative-algorithm`	`None`	`EAGLE3`, `NEXTN`	A2, A3
`--speculative-draft-model-path` `--speculative-draft-model`	`None`	Type: str	A2, A3
`--speculative-draft-model-` `revision`	`None`	Type: str, `branch name`, `tag name`, `commit id`	A2, A3
`--speculative-draft-load-format`	`auto`	`auto`, `dummy`	A2, A3
`--speculative-num-steps`	`None`	Type: int	A2, A3
`--speculative-eagle-topk`	`None`	Type: int	A2, A3
`--speculative-num-draft-tokens`	`None`	Type: int	A2, A3
`--speculative-accept-` `threshold-single`	`1.0`	Type: float	Special for GPU
`--speculative-accept-` `threshold-acc`	`1.0`	Type: float	Special for GPU
`--speculative-token-map`	`None`	Type: str	A2, A3
`--speculative-attention-` `mode`	`prefill`	`prefill`, `decode`	A2, A3
`--speculative-moe-runner-` `backend`	`None`	`auto`	A2, A3
`--speculative-moe-a2a-` `backend`	`None`	`ascend_fuseep` (the only supported value on Ascend NPU)	A2, A3
`--speculative-draft-attention-backend`	`None`	`ascend`	A2, A3
`--speculative-draft-model-quantization`	`None`	`unquant` (the only supported value for speculative decoding on Ascend NPU)	A2, A3

Ngram speculative decoding

Argument	Defaults	Options	Server supported
`--speculative-ngram-` `min-match-window-size`	`1`	Type: int	Experimental
`--speculative-ngram-` `max-match-window-size`	`12`	Type: int	Experimental
`--speculative-ngram-` `min-bfs-breadth`	`1`	Type: int	Experimental
`--speculative-ngram-` `max-bfs-breadth`	`10`	Type: int	Experimental
`--speculative-ngram-` `match-type`	`BFS`	`BFS`, `PROB`	Experimental. `BFS` uses recency-based expansion; `PROB` uses frequency-based expansion.
`—speculative-ngram-` `max-trie-depth`	`18`	Type: int	Experimental
`--speculative-ngram-` `capacity`	`10000000`	Type: int	Experimental
`--speculative-ngram-external-corpus-path`	`None`	Type: str	Experimental
`--speculative-ngram-external-sam-budget`	`0`	Type: int	Experimental
`--speculative-ngram-external-corpus-max-tokens`	`10000000`	Type: int	Experimental

Expert parallelism

Argument	Defaults	Options	Server supported
`--expert-parallel-size` `--ep-size` `--ep`	`1`	Type: int	A2, A3
`--moe-a2a-backend`	`none`	`none`, `deepep`, `ascend_fuseep`(It is incompatible with eplb)	A2, A3
`--moe-runner-backend`	`auto`	`auto`, `triton`	A2, A3
`--flashinfer-mxfp4-` `moe-precision`	`default`	`default`, `bf16`	Special for GPU
`--enable-flashinfer-` `allreduce-fusion`	`False`	bool flag (set to enable)	Special for GPU
`--deepep-mode`	`auto`	`normal`, `low_latency`, `auto`	A2, A3
`--deepep-config`	`None`	Type: str	Special for GPU
`--ep-num-redundant-experts`	`0`	Type: int	A2, A3
`--ep-dispatch-algorithm`	`None`	`static`, `dynamic`, `fake`	A2, A3
`--init-expert-location`	`trivial`	`trivial`, `<path.pt>`, `<path.json>`, `<json_string>`	A2, A3
`--enable-eplb`	`False`	bool flag (set to enable)	A2, A3
`--eplb-algorithm`	`deepseek`	`auto`, `deepseek`	A2, A3
`—eplb-rebalance-num-iterations`	`1000`	Type: int	A2, A3
`—eplb-rebalance-layers-` `per-chunk`	`None`	Type: int	A2, A3
`—eplb-min-rebalancing-` `utilization-threshold`	`1.0`	Type: float	A2, A3
`—expert-distribution-` `recorder-mode`	`None`	`stat`, `stat_approx`, `per_pass`, `per_token`	A2, A3
`—expert-distribution-` `recorder-buffer-size`	`None`	Type: int	A2, A3
`—enable-expert-distribution-` `metrics`	`False`	bool flag (set to enable)	A2, A3
`—moe-dense-tp-size`	`None`	`1`	A2, A3
`—elastic-ep-backend`	`None`	`none`, `mooncake`	Special for GPU
`--mooncake-ib-device`	`None`	Type: str	Special for GPU

Mamba Cache

Argument	Defaults	Options	Server supported
`--max-mamba-cache-size`	`None`	Type: int	A2, A3
`--mamba-ssm-dtype`	`float32`	`float32`, `bfloat16`, `float16`	A2, A3
`--mamba-full-memory-ratio`	`0.9`	Type: float	A2, A3
`--mamba-scheduler-strategy`	`auto`	`auto`, `no_buffer`, `extra_buffer`	A2, A3
`--mamba-track-interval`	`256`	Type: int	A2, A3

Hierarchical cache

Argument	Defaults	Options	Server supported
`--enable-hierarchical-` `cache`	`False`	bool flag (set to enable). Currently, mamba cache is not supported.	A2, A3
`--hicache-ratio`	`2.0`	Type: float	A2, A3
`--hicache-size`	`0`	Type: int	A2, A3
`--hicache-write-policy`	`write_through`	Currently only `write_back` supported	A2, A3
`—hicache-io-backend`	`kernel`	`kernel_ascend`, `direct`	A2, A3
`—hicache-mem-layout`	`layer_first`	`page_first_direct`, `page_first_kv_split`	A2, A3
`—hicache-storage-` `backend`	`None`	`file`	A2, A3
`—hicache-storage-` `prefetch-policy`	`timeout`	`best_effort`, `wait_complete`, `timeout`	Special for GPU
`—hicache-storage-` `backend-extra-config`	`None`	Type: str	Special for GPU

LMCache

Argument	Defaults	Options	Server supported
`--enable-lmcache`	`False`	bool flag (set to enable)	Special for GPU
`--lmcache-config-file`	`None`	Type: str	Special for GPU

Diffusion LLM

Argument	Defaults	Options	Server supported
`--dllm-algorithm`	`None`	Type: str	A2, A3
`--dllm-algorithm-config`	`None`	Type: str	A2, A3

Offloading (must be used with `--disable-cuda-graph`)

Argument	Defaults	Options	Server supported
`--cpu-offload-gb`	`0`	Type: int	A2, A3
`--offload-group-size`	`-1`	Type: int (DeepSeek only)	A2, A3
`--offload-num-in-group`	`1`	Type: int (DeepSeek only)	A2, A3
`--offload-prefetch-step`	`1`	Type: int (DeepSeek only)	A2, A3
`--offload-mode`	`cpu`	`cpu` (DeepSeek only) `meta` (DeepSeek only) `sharded_gpu` (DeepSeek only, only support tp=1 dp>1)	A2, A3

Optimization/debug options

Argument	Defaults	Options	Server supported
`--disable-radix-cache`	`False`	bool flag (set to enable)	A2, A3
`--cuda-graph-max-bs`	`None`	Type: int	A2, A3
`--cuda-graph-bs`	`None`	List[int]	A2, A3
`--disable-cuda-graph`	`False`	bool flag (set to enable)	A2, A3
`--disable-cuda-graph-` `padding`	`False`	bool flag (set to enable)	A2, A3
`--enable-profile-` `cuda-graph`	`False`	bool flag (set to enable)	A2, A3
`--enable-cudagraph-gc`	`False`	bool flag (set to enable)	A2, A3
`--enable-nccl-nvls`	`False`	bool flag (set to enable)	Special for GPU
`--enable-symm-mem`	`False`	bool flag (set to enable)	Special for GPU
`--disable-flashinfer-` `cutlass-moe-fp4-allgather`	`False`	bool flag (set to enable)	Special for GPU
`--enable-tokenizer-` `batch-encode`	`False`	bool flag (set to enable)	A2, A3
`—disable-tokenizer-` `batch-decode`	`False`	bool flag (set to enable)	A2, A3
`—disable-custom-` `all-reduce`	`False`	bool flag (set to enable)	Special for GPU
`—enable-mscclpp`	`False`	bool flag (set to enable)	Special for GPU
`—enable-torch-` `symm-mem`	`False`	bool flag (set to enable)	Special for GPU
`—disable-overlap` `-schedule`	`False`	bool flag (set to enable)	A2, A3
`—enable-mixed-` `chunk`	`False`	bool flag (set to enable)	A2, A3
`—enable-dp-attention`	`False`	bool flag (set to enable)	A2, A3
`—enable-dp-attention-local-control-broadcast`	`False`	bool flag (set to enable)	A2, A3
`—enable-dp-lm-head`	`False`	bool flag (set to enable)	A2, A3
`—enable-two-` `batch-overlap`	`False`	bool flag (set to enable)	Planned
`—enable-single-` `batch-overlap`	`False`	bool flag (set to enable)	A2, A3
`—tbo-token-` `distribution-threshold`	`0.48`	Type: float	Planned
`—enable-torch-` `compile`	`False`	bool flag (set to enable)	A2, A3
`—enable-torch-` `compile-debug-mode`	`False`	bool flag (set to enable)	A2, A3
`—enforce-piecewise-` `cuda-graph`	`False`	bool flag (set to enable); Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported.	A2, A3
`—piecewise-cuda-` `graph-tokens`	`None`	Type: JSON list	A2, A3
`—piecewise-cuda-` `graph-compiler`	`eager`	`eager`	A2, A3
`—torch-compile-max-bs`	`32`	Type: int	A2, A3
`—piecewise-cuda-` `graph-max-tokens`	`None`	Type: int	A2, A3
`—torchao-config`	“	Type: str	Special for GPU
`—enable-nan-detection`	`False`	bool flag (set to enable)	A2, A3
`—enable-p2p-check`	`False`	bool flag (set to enable)	Special for GPU
`—triton-attention-` `reduce-in-fp32`	`False`	bool flag (set to enable)	Special for GPU
`—triton-attention-` `num-kv-splits`	`8`	Type: int	Special for GPU
`—triton-attention-` `split-tile-size`	`None`	Type: int	Special for GPU
`—delete-ckpt-` `after-loading`	`False`	bool flag (set to enable)	A2, A3
`—enable-memory-saver`	`False`	bool flag (set to enable)	A2, A3
`—enable-weights-` `cpu-backup`	`False`	bool flag (set to enable)	A2, A3
`—enable-draft-weights-` `cpu-backup`	`False`	bool flag (set to enable)	A2, A3
`—allow-auto-truncate`	`False`	bool flag (set to enable)	A2, A3
`—enable-custom-` `logit-processor`	`False`	bool flag (set to enable)	A2, A3
`—flashinfer-mla-` `disable-ragged`	`False`	bool flag (set to enable)	Special for GPU
`—disable-shared-` `experts-fusion`	`True`	bool flag (set to enable)	A2, A3
`—enforce-shared-experts-fusion`	`False`	bool flag (set to enable)	A2, A3
`—disable-chunked-` `prefix-cache`	`True`	bool flag (set to enable)	A2, A3
`—disable-fast-` `image-processor`	`False`	bool flag (set to enable)	A2, A3
`—keep-mm-feature-` `on-device`	`False`	bool flag (set to enable)	A2, A3
`—enable-return-` `hidden-states`	`False`	bool flag (set to enable)	A2, A3
`—enable-return-` `routed-experts`	`False`	bool flag (set to enable)	A2, A3
`—scheduler-recv-` `interval`	`1`	Type: int	A2, A3
`—numa-node`	`None`	List[int]	A2, A3
`—enable-deterministic-` `inference`	`False`	bool flag (set to enable)	Planned
`--rl-on-policy-target`	`None`	`fsdp`	Planned
`--enable-layerwise-` `nvtx-marker`	`False`	bool flag (set to enable)	Special for GPU
`--enable-attn-tp-` `input-scattered`	`False`	bool flag (set to enable)	Experimental
`--enable-dsa-prefill-` `context-parallel`	`False`	bool flag (set to enable)	A2, A3
`--enable-prefill-context-parallel`	`False`	bool flag (set to enable)	A2, A3
`--prefill-cp-mode`	`in-seq-split`	Type: str	A2, A3
`--enable-fused-qk-` `norm-rope`	`False`	bool flag (set to enable)	Special for GPU
`--enable-precise-embedding-interpolation`	`False`	bool flag (set to enable)	A2, A3
`--gc-threshold`	`None`	List[int]	A2, A3

Dynamic batch tokenizer

Argument	Defaults	Options	Server supported
`--enable-dynamic-` `batch-tokenizer`	`False`	bool flag (set to enable)	A2, A3
`--dynamic-batch-` `tokenizer-batch-size`	`32`	Type: int	A2, A3
`--dynamic-batch-` `tokenizer-batch-timeout`	`0.002`	Type: float	A2, A3

Debug tensor dumps

Argument	Defaults	Options	Server supported
`--debug-tensor-dump-` `output-folder`	`None`	Type: str	A2, A3
`--debug-tensor-dump-` `layers`	`None`	List[int]	A2, A3
`--debug-tensor-dump-` `input-file`	`None`	Type: str	A2, A3

PD disaggregation

Argument	Defaults	Options	Server supported
`--disaggregation-mode`	`null`	`null`, `prefill`, `decode`	A2, A3
`--disaggregation-transfer-backend`	`mooncake`	`ascend`	A2, A3
`--disaggregation-bootstrap-port`	`8998`	Type: int	A2, A3
`—disaggregation-ib-device`	`None`	Type: str	Special for GPU
`—disaggregation-decode-` `enable-offload-kvcache`	`False`	`False`	A2, A3
`—num-reserved-decode-tokens`	`512`	Type: int	A2, A3
`—disaggregation-decode-` `polling-interval`	`1`	Type: int	A2, A3

Encode prefill disaggregation

Argument	Defaults	Options	Server supported
`—enable-adaptive-dispatch-to-encoder`	`False`	bool flag (set to enable adaptively dispatch)	A2, A3
`—encoder-only`	`False`	bool flag (set to launch an encoder-only server)	A2, A3
`—language-only`	`False`	bool flag (set to load weights for the language model only)	A2, A3
`—encoder-transfer-backend`	`zmq_to_scheduler`	`zmq_to_scheduler`, `zmq_to_tokenizer`, `mooncake`	A2, A3
`--encoder-urls`	`[]`	List[str] (List of encoder server urls)	A2, A3

Custom weight loader

Argument	Defaults	Options	Server supported
`--custom-weight-loader`	`None`	List[str]	A2, A3
`--weight-loader-disable-` `mmap`	`False`	bool flag (set to enable)	A2, A3
`--weight-loader-prefetch-checkpoints`	`False`	bool flag (set to enable)	A2, A3
`--weight-loader-prefetch-num-threads`	`4`	Type: int	A2, A3
`--remote-instance-weight-` `loader-seed-instance-ip`	`None`	Type: str	Special for GPU
`--remote-instance-weight-` `loader-seed-instance-service-port`	`None`	Type: int	Special for GPU
`--remote-instance-weight-` `loader-send-weights-group-ports`	`None`	Type: JSON list	Special for GPU
`--remote-instance-weight-` `loader-backend`	`nccl`	`transfer_engine`, `nccl`	Special for GPU
`--remote-instance-weight-` `loader-start-seed-via-transfer-engine`	`False`	bool flag (set to enable)	Special for GPU

For PD-Multiplexing

Argument	Defaults	Options	Server supported
`--enable-pdmux`	`False`	bool flag (set to enable)	Special for GPU
`--pdmux-config-path`	`None`	Type: str	Special for GPU
`--sm-group-num`	`8`	Type: int	Special for GPU

Argument	Defaults	Options	Server supported
`—enable-broadcast-mm-` `inputs-process`	`False`	bool flag (set to enable)	A2, A3
`—mm-process-config`	`None`	Type: JSON / Dict	A2, A3
`—mm-enable-dp-encoder`	`False`	bool flag (set to enable)	A2, A3
`—limit-mm-data-per-request`	`None`	Type: JSON / Dict	A2, A3

For checkpoint decryption

Argument	Defaults	Options	Server supported
`--decrypted-config-file`	`None`	Type: str	A2, A3
`--decrypted-draft-config-file`	`None`	Type: str	A2, A3
`--enable-prefix-mm-cache`	`False`	bool flag (set to enable)	A2, A3

Forward hooks

Argument	Defaults	Options	Server supported
`—forward-hooks`	`None`	Type: JSON list	A2, A3

Configuration file support

Argument	Defaults	Options	Server supported
`—config`	`None`	Type: str	A2, A3

Other Params

The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.

Argument	Defaults	Options
`--checkpoint-engine-` `wait-weights-` `before-ready`	`False`	bool flag (set to enable)
`--kt-weight-path`	`None`	Type: str
`--kt-method`	`AMXINT4`	Type: str
`--kt-cpuinfer`	`None`	Type: int
`--kt-threadpool-count`	2	Type: int
`--kt-num-gpu-experts`	`None`	Type: int
`--kt-max-deferred-` `experts-per-token`	`None`	Type: int

The following parameters have some functional deficiencies on community

Argument	Defaults	Options
`—tool-server`	`None`	Type: str

​Model and tokenizer

​HTTP server

​SSL/TLS

​Quantization and data type

​Memory and scheduling

​Runtime options

​Logging

​RequestMetricsExporter configuration

​API related

​Data parallelism

​Multi-node distributed serving

​Model override args

​LoRA

​Kernel Backends (Attention, Sampling, Grammar, GEMM)

​Speculative decoding

​Ngram speculative decoding

​Expert parallelism

​Mamba Cache

​Hierarchical cache

​LMCache

​Diffusion LLM

​Offloading (must be used with --disable-cuda-graph)

​Optimization/debug options

​Dynamic batch tokenizer

​Debug tensor dumps

​PD disaggregation

​Encode prefill disaggregation

​Custom weight loader

​For PD-Multiplexing

​For Multi-Modal

​For checkpoint decryption

​Forward hooks

​Configuration file support

​Other Params