Model and tokenizer
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--model-path--model | None | Type: str | A2, A3 |
--tokenizer-path | None | Type: str | A2, A3 |
--tokenizer-mode | auto | auto, slow | A2, A3 |
--tokenizer-worker-num | 1 | Type: int | A2, A3 |
--skip-tokenizer-init | False | bool flag (set to enable) | A2, A3 |
--load-format | auto | auto, safetensors | A2, A3 |
--model-loader- extra-config | Type: str | A2, A3 | |
--trust-remote-code | False | bool flag (set to enable) | A2, A3 |
--context-length | None | Type: int | A2, A3 |
--is-embedding | False | bool flag (set to enable) | A2, A3 |
--enable-multimodal | None | bool flag (set to enable) | A2, A3 |
--revision | None | Type: str | A2, A3 |
--model-impl | auto | auto, sglang,<br/> transformers | A2, A3 |
HTTP server
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--host | 127.0.0.1 | Type: str | A2, A3 |
--port | 30000 | Type: int | A2, A3 |
--skip-server-warmup | False | bool flag (set to enable) | A2, A3 |
--warmups | None | Type: str | A2, A3 |
--nccl-port | None | Type: int | A2, A3 |
--fastapi-root-path | None | Type: str | A2, A3 |
--grpc-mode | False | False | Planned |
Quantization and data type
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--dtype | auto | auto,<br/> float16,<br/> bfloat16 | A2, A3 |
--quantization | None | modelslim | A2, A3 |
--quantization-param-path | None | Type: str | Special For GPU |
--kv-cache-dtype | auto | auto | A2, A3 |
--enable-fp32-lm-head | False | bool flag (set to enable) | A2, A3 |
--modelopt-quant | None | Type: str | Special For GPU |
--modelopt-checkpoint-restore-path | None | Type: str | Special For GPU |
--modelopt-checkpoint-save-path | None | Type: str | Special For GPU |
--modelopt-export-path | None | Type: str | Special For GPU |
--quantize-and-serve | False | bool flag (set to enable) | Special For GPU |
--rl-quant-profile | None | Type: str | Special For GPU |
Memory and scheduling
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--mem-fraction-static | None | Type: float | A2, A3 |
--max-running-requests | None | Type: int | A2, A3 |
--prefill-max-requests | None | Type: int | A2, A3 |
--max-queued-requests | None | Type: int | A2, A3 |
--max-total-tokens | None | Type: int | A2, A3 |
--chunked-prefill-size | None | Type: int | A2, A3 |
--max-prefill-tokens | 16384 | Type: int | A2, A3 |
--schedule-policy | fcfs | lpm, fcfs | A2, A3 |
--enable-priority-scheduling | False | bool flag (set to enable) | A2, A3 |
--schedule-low-priority-values-first | False | bool flag (set to enable) | A2, A3 |
--priority-scheduling-preemption-threshold | 10 | Type: int | A2, A3 |
--schedule-conservativeness | 1.0 | Type: float | A2, A3 |
--page-size | 128 | Type: int | A2, A3 |
--swa-full-tokens-ratio | 0.8 | Type: float | Planned |
--disable-hybrid-swa-memory | False | bool flag (set to enable) | Planned |
—radix-eviction-policy | lru | lru,<br/>lfu | A2, A3 |
—enable-prefill-delayer | False | bool flag (set to enable) | A2, A3 |
—prefill-delayer-max-delay-passes | 30 | Type: int | A2, A3 |
—prefill-delayer-token-usage-low-watermark | None | Type: float | A2, A3 |
—prefill-delayer-forward-passes-buckets | None | List[float] | A2, A3 |
—prefill-delayer-wait-seconds-buckets | None | List[float] | A2, A3 |
—abort-on-priority-<br/>when-disabled | False | bool flag (set to enable) | A2, A3 |
--enable-dynamic-chunking | False | bool flag (set to enable) | Experimental |
Runtime options
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--device | None | Type: str | A2, A3 |
--tensor-parallel-size--tp-size | 1 | Type: int | A2, A3 |
--pipeline-parallel-size--pp-size | 1 | Type: int; Currently 2 not supported | Experimental |
—attention-context-parallel-size<br/>—attn-cp-size | 1 | Type: int; must be equal to —tp-size | A2, A3 |
—moe-data-parallel-size<br/>—moe-dp-size | 1 | Type: int | Planned |
—pp-max-micro-batch-size | None | Type: int | Experimental |
—pp-async-batch-depth | None | Type: int | Experimental |
—stream-interval | 1 | Type: int | A2, A3 |
—incremental-streaming-output | False | bool flag (set to enable) | A2, A3 |
—random-seed | None | Type: int | A2, A3 |
—constrained-json-<br/>whitespace-pattern | None | Type: str | A2, A3 |
—constrained-json-<br/>disable-any-whitespace | False | bool flag (set to enable) | A2, A3 |
—watchdog-timeout | 300 | Type: float | A2, A3 |
—soft-watchdog-timeout | 300 | Type: float | A2, A3 |
—dist-timeout | None | Type: int | A2, A3 |
—download-dir | None | Type: str | A2, A3 |
—model-checksum | None | Type: str | Planned |
—base-gpu-id | 0 | Type: int | A2, A3 |
—gpu-id-step | 1 | Type: int | A2, A3 |
—sleep-on-idle | False | bool flag (set to enable) | A2, A3 |
Logging
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--log-level | info | Type: str | A2, A3 |
--log-level-http | None | Type: str | A2, A3 |
--log-requests | False | bool flag (set to enable) | A2, A3 |
--log-requests-level | 2 | 0, 1, 2, 3 | A2, A3 |
--log-requests-format | text | text, json | A2, A3 |
--crash-dump-folder | None | Type: str | A2, A3 |
--enable-metrics | False | bool flag (set to enable) | A2, A3 |
--enable-metrics-for-all-schedulers | False | bool flag (set to enable) | A2, A3 |
--tokenizer-metrics-custom-labels-header | x-custom-labels | Type: str | A2, A3 |
--tokenizer-metrics-allowed-custom-labels | None | List[str] | A2, A3 |
--bucket-time-to-first-token | None | List[float] | A2, A3 |
--bucket-inter-token-latency | None | List[float] | A2, A3 |
--bucket-e2e-request-latency | None | List[float] | A2, A3 |
--collect-tokens-histogram | False | bool flag (set to enable) | A2, A3 |
--prompt-tokens-buckets | None | List[str] | A2, A3 |
--generation-tokens-buckets | None | List[str] | A2, A3 |
--gc-warning-threshold-secs | 0.0 | Type: float | A2, A3 |
--decode-log-interval | 40 | Type: int | A2, A3 |
--enable-request-time-stats-logging | False | bool flag (set to enable) | A2, A3 |
--kv-events-config | None | Type: str | Special for GPU |
--enable-trace | False | bool flag (set to enable) | A2, A3 |
--oltp-traces-endpoint | localhost:4317 | Type: str | A2, A3 |
—log-requests-target | None | Type: str | A2, A3 |
—uvicorn-access-log-exclude-prefixes | [] | List[str] | A2, A3 |
RequestMetricsExporter configuration
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--export-metrics-to-file | False | bool flag (set to enable) | A2, A3 |
--export-metrics-to-file-dir | None | Type: str | A2, A3 |
API related
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--api-key | None | Type: str | A2, A3 |
--admin-api-key | None | Type: str | A2, A3 |
--served-model-name | None | Type: str | A2, A3 |
--weight-version | default | Type: str | A2, A3 |
--chat-template | None | Type: str | A2, A3 |
—hf-chat-template-name | None | Type: str | A2, A3 |
—completion-template | None | Type: str | A2, A3 |
—enable-cache-report | False | bool flag<br/> (set to enable) | A2, A3 |
—reasoning-parser | None | deepseek-r1<br/>deepseek-v3<br/>glm45<br/>gpt-oss<br/>kimi<br/>qwen3<br/>qwen3-thinking<br/>step3 | A2, A3 |
—tool-call-parser | None | llama3<br/> pythonic<br/> qwen<br/> qwen3_coder | A2, A3 |
--sampling-defaults | model | openai, model | A2, A3 |
Data parallelism
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--data-parallel-size--dp-size | 1 | Type: int | A2, A3 |
--load-balance-method | auto | auto,<br/> round_robin,<br/> follow_bootstrap_room,<br/> total_requests,<br/> total_tokens | A2, A3 |
Multi-node distributed serving
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--dist-init-addr--nccl-init-addr | None | Type: str | A2, A3 |
--nnodes | 1 | Type: int | A2, A3 |
--node-rank | 0 | Type: int | A2, A3 |
Model override args
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--json-model-override-args | {} | Type: str | A2, A3 |
--preferred-sampling-params | None | Type: str | A2, A3 |
LoRA
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-lora | False | Bool flag (set to enable) | A2, A3 |
—enable-lora-overlap-loading | False | Bool flag <br/>(set to enable) | A2, A3 |
—max-lora-rank | None | Type: int | A2, A3 |
—lora-target-modules | None | all | A2, A3 |
—lora-paths | None | Type: List[str] /<br/> JSON objects | A2, A3 |
—max-loras-per-batch | 8 | Type: int | A2, A3 |
—max-loaded-loras | None | Type: int | A2, A3 |
—lora-eviction-policy | lru | lru,<br/> fifo | A2, A3 |
—lora-backend | csgmv | triton,<br/>csgmv,<br/>ascend,<br/>torch_native | A2, A3 |
--max-lora-chunk-size | 16 | 16, 32,<br/> 64, 128 | Special for GPU |
Kernel Backends (Attention, Sampling, Grammar, GEMM)
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--attention-backend | None | ascend | A2, A3 |
--prefill-attention-backend | None | ascend | A2, A3 |
--decode-attention-backend | None | ascend | A2, A3 |
--sampling-backend | None | pytorch,<br/>ascend | A2, A3 |
--grammar-backend | None | xgrammar | A2, A3 |
--mm-attention-backend | None | ascend_attn | A2, A3 |
--nsa-prefill-backend | flashmla_sparse | flashmla_sparse,<br/> flashmla_decode,<br/>fa3,<br/> tilelang,<br/> aiter | Special for GPU |
--nsa-decode-backend | fa3 | flashmla_prefill,<br/> flashmla_kv,<br/> fa3,<br/>tilelang,<br/> aiter | Special for GPU |
--fp8-gemm-backend | auto | auto,<br/> deep_gemm,<br/> flashinfer_trtllm,<br/>flashinfer_cutlass,<br/>flashinfer_deepgemm,<br/>cutlass,<br/> triton,<br/> aiter | Special for GPU |
--disable-flashinfer-autotune | False | bool flag (set to enable) | Special for GPU |
Speculative decoding
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--speculative-algorithm | None | EAGLE3,<br/> NEXTN | A2, A3 |
--speculative-draft-model-path--speculative-draft-model | None | Type: str | A2, A3 |
--speculative-draft-model-revision | None | Type: str,<br/> branch name,<br/> tag name,<br/> commit id | A2, A3 |
--speculative-draft-load-format | auto | auto,<br/> dummy | A2, A3 |
--speculative-num-steps | None | Type: int | A2, A3 |
--speculative-eagle-topk | None | Type: int | A2, A3 |
--speculative-num-draft-tokens | None | Type: int | A2, A3 |
--speculative-accept-threshold-single | 1.0 | Type: float | Special for GPU |
--speculative-accept-threshold-acc | 1.0 | Type: float | Special for GPU |
--speculative-token-map | None | Type: str | A2, A3 |
--speculative-attention-mode | prefill | prefill,<br/> decode | A2, A3 |
--speculative-moe-runner-backend | None | auto | A2, A3 |
--speculative-moe-a2a-backend | None | ascend_fuseep | A2, A3 |
--speculative-draft-attention-backend | None | ascend | A2, A3 |
--speculative-draft-model-quantization | None | unquant | A2, A3 |
Ngram speculative decoding
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--speculative-ngram-min-match-window-size | 1 | Type: int | Experimental |
--speculative-ngram-max-match-window-size | 12 | Type: int | Experimental |
--speculative-ngram-min-bfs-breadth | 1 | Type: int | Experimental |
--speculative-ngram-max-bfs-breadth | 10 | Type: int | Experimental |
--speculative-ngram-match-type | BFS | BFS,<br/> PROB | Experimental. BFS uses recency-based expansion; PROB uses frequency-based expansion. |
—speculative-ngram-<br/>max-trie-depth | 18 | Type: int | Experimental |
--speculative-ngram-capacity | 10000000 | Type: int | Experimental |
Expert parallelism
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--expert-parallel-size--ep-size--ep | 1 | Type: int | A2, A3 |
--moe-a2a-backend | none | none,<br/> deepep,<br/> ascend_fuseep(It is incompatible with eplb) | A2, A3 |
--moe-runner-backend | auto | auto, triton | A2, A3 |
--flashinfer-mxfp4-moe-precision | default | default,<br/> bf16 | Special for GPU |
--enable-flashinfer-allreduce-fusion | False | bool flag (set to enable) | Special for GPU |
--deepep-mode | auto | normal, <br/>low_latency,<br/> auto | A2, A3 |
--deepep-config | None | Type: str | Special for GPU |
--ep-num-redundant-experts | 0 | Type: int | A2, A3 |
--ep-dispatch-algorithm | None | static,<br/> dynamic,<br/> fake | A2, A3 |
--init-expert-location | trivial | trivial,<br/> <path.pt>,<br/> <path.json>,<br/> <json_string> | A2, A3 |
--enable-eplb | False | bool flag (set to enable) | A2, A3 |
--eplb-algorithm | deepseek | auto,<br/> deepseek | A2, A3 |
—eplb-rebalance-num-iterations | 1000 | Type: int | A2, A3 |
—eplb-rebalance-layers-<br/>per-chunk | None | Type: int | A2, A3 |
—eplb-min-rebalancing-<br/>utilization-threshold | 1.0 | Type: float | A2, A3 |
—expert-distribution-<br/>recorder-mode | None | stat,<br/> stat_approx,<br/> per_pass,<br/> per_token | A2, A3 |
—expert-distribution-<br/>recorder-buffer-size | None | Type: int | A2, A3 |
—enable-expert-distribution-<br/>metrics | False | bool flag (set to enable) | A2, A3 |
—moe-dense-tp-size | None | 1 | A2, A3 |
—elastic-ep-backend | None | none, mooncake | Special for GPU |
--mooncake-ib-device | None | Type: str | Special for GPU |
Mamba Cache
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--max-mamba-cache-size | None | Type: int | A2, A3 |
--mamba-ssm-dtype | float32 | float32,<br/>bfloat16,<br/>float16 | A2, A3 |
--mamba-full-memory-ratio | 0.9 | Type: float | A2, A3 |
--mamba-scheduler-strategy | auto | auto,<br/>no_buffer,<br/>extra_buffer | A2, A3 |
--mamba-track-interval | 256 | Type: int | A2, A3 |
Hierarchical cache
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-hierarchical-cache | False | bool flag<br/> (set to enable).<br/> Currently, mamba cache is not supported. | A2, A3 |
--hicache-ratio | 2.0 | Type: float | A2, A3 |
--hicache-size | 0 | Type: int | A2, A3 |
--hicache-write-policy | write_through | Currently only write_back supported | A2, A3 |
—hicache-io-backend | kernel | kernel_ascend,<br/> direct | A2, A3 |
—hicache-mem-layout | layer_first | page_first_direct,<br/> page_first_kv_split | A2, A3 |
—hicache-storage-<br/>backend | None | file | A2, A3 |
—hicache-storage-<br/>prefetch-policy | best_effort | best_effort,<br/> wait_complete,<br/> timeout | Special for GPU |
—hicache-storage-<br/>backend-extra-config | None | Type: str | Special for GPU |
LMCache
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-lmcache | False | bool flag (set to enable) | Special for GPU |
Offloading (must be used with --disable-cuda-graph)
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--cpu-offload-gb | 0 | Type: int | A2, A3 |
--offload-group-size | -1 | Type: int (DeepSeek only) | A2, A3 |
--offload-num-in-group | 1 | Type: int (DeepSeek only) | A2, A3 |
--offload-prefetch-step | 1 | Type: int (DeepSeek only) | A2, A3 |
--offload-mode | cpu | cpu (DeepSeek only) <br/>meta (DeepSeek only) <br/>sharded_gpu (DeepSeek only) | A2, A3 |
Args for multi-item scoring
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--multi-item-scoring-delimiter | None | Type: int | A2, A3 |
Optimization/debug options
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--disable-radix-cache | False | bool flag (set to enable) | A2, A3 |
--cuda-graph-max-bs | None | Type: int | A2, A3 |
--cuda-graph-bs | None | List[int] | A2, A3 |
--disable-cuda-graph | False | bool flag (set to enable) | A2, A3 |
--disable-cuda-graph-padding | False | bool flag (set to enable) | A2, A3 |
--enable-profile-cuda-graph | False | bool flag (set to enable) | A2, A3 |
--enable-cudagraph-gc | False | bool flag (set to enable) | A2, A3 |
--enable-nccl-nvls | False | bool flag (set to enable) | Special for GPU |
--enable-symm-mem | False | bool flag (set to enable) | Special for GPU |
--disable-flashinfer-cutlass-moe-fp4-allgather | False | bool flag (set to enable) | Special for GPU |
--enable-tokenizer-batch-encode | False | bool flag (set to enable) | A2, A3 |
—disable-tokenizer-<br/>batch-decode | False | bool flag (set to enable) | A2, A3 |
—disable-custom-<br/>all-reduce | False | bool flag (set to enable) | Special for GPU |
—enable-mscclpp | False | bool flag (set to enable) | Special for GPU |
—enable-torch-<br/>symm-mem | False | bool flag (set to enable) | Special for GPU |
—disable-overlap<br/>-schedule | False | bool flag (set to enable) | A2, A3 |
—enable-mixed-<br/>chunk | False | bool flag (set to enable) | A2, A3 |
—enable-dp-attention | False | bool flag (set to enable) | A2, A3 |
—enable-dp-lm-head | False | bool flag (set to enable) | A2, A3 |
—enable-two-<br/>batch-overlap | False | bool flag (set to enable) | Planned |
—enable-single-<br/>batch-overlap | False | bool flag (set to enable) | A2, A3 |
—tbo-token-<br/>distribution-threshold | 0.48 | Type: float | Planned |
—enable-torch-<br/>compile | False | bool flag<br/> (set to enable) | A2, A3 |
—enable-torch-<br/>compile-debug-mode | False | bool flag (set to enable) | A2, A3 |
—enforce-piecewise-<br/>cuda-graph | False | bool flag<br/> (set to enable); <br/> Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported. | A2, A3 |
—piecewise-cuda-<br/>graph-tokens | None | Type: JSON<br/> list | A2, A3 |
—piecewise-cuda-<br/>graph-compiler | eager | eager | A2, A3 |
—torch-compile-max-bs | 32 | Type: int | A2, A3 |
—piecewise-cuda-<br/>graph-max-tokens | None | Type: int | A2, A3 |
—torchao-config | “ | Type: str | Special for GPU |
—enable-nan-detection | False | bool flag<br/> (set to enable) | A2, A3 |
—enable-p2p-check | False | bool flag (set to enable) | Special for GPU |
—triton-attention-<br/>reduce-in-fp32 | False | bool flag (set to enable) | Special for GPU |
—triton-attention-<br/>num-kv-splits | 8 | Type: int | Special for GPU |
—triton-attention-<br/>split-tile-size | None | Type: int | Special for GPU |
—delete-ckpt-<br/>after-loading | False | bool flag<br/> (set to enable) | A2, A3 |
—enable-memory-saver | False | bool flag (set to enable) | A2, A3 |
—enable-weights-<br/>cpu-backup | False | bool flag (set to enable) | A2, A3 |
—enable-draft-weights-<br/>cpu-backup | False | bool flag (set to enable) | A2, A3 |
—allow-auto-truncate | False | bool flag (set to enable) | A2, A3 |
—enable-custom-<br/>logit-processor | False | bool flag (set to enable) | A2, A3 |
—flashinfer-mla-<br/>disable-ragged | False | bool flag (set to enable) | Special for GPU |
—disable-shared-<br/>experts-fusion | True | bool flag (set to enable) | A2, A3 |
—disable-chunked-<br/>prefix-cache | True | bool flag (set to enable) | A2, A3 |
—disable-fast-<br/>image-processor | False | bool flag (set to enable) | A2, A3 |
—keep-mm-feature-<br/>on-device | False | bool flag (set to enable) | A2, A3 |
—enable-return-<br/>hidden-states | False | bool flag (set to enable) | A2, A3 |
—enable-return-<br/>routed-experts | False | bool flag (set to enable) | A2, A3 |
—scheduler-recv-<br/>interval | 1 | Type: int | A2, A3 |
—numa-node | None | List[int] | A2, A3 |
—enable-deterministic-<br/>inference | False | bool flag<br/> (set to enable) | Planned |
--rl-on-policy-target | None | fsdp | Planned |
--enable-layerwise-nvtx-marker | False | bool flag (set to enable) | Special for GPU |
--enable-attn-tp-input-scattered | False | bool flag (set to enable) | Experimental |
--enable-nsa-prefill-context-parallel | False | bool flag (set to enable) | A2, A3 |
--enable-fused-qk-norm-rope | False | bool flag (set to enable) | Special for GPU |
Dynamic batch tokenizer
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-dynamic-batch-tokenizer | False | bool flag (set to enable) | A2, A3 |
--dynamic-batch-tokenizer-batch-size | 32 | Type: int | A2, A3 |
--dynamic-batch-tokenizer-batch-timeout | 0.002 | Type: float | A2, A3 |
Debug tensor dumps
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--debug-tensor-dump-output-folder | None | Type: str | A2, A3 |
--debug-tensor-dump-layers | None | List[int] | A2, A3 |
--debug-tensor-dump-input-file | None | Type: str | A2, A3 |
PD disaggregation
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--disaggregation-mode | null | null,<br/> prefill,<br/> decode | A2, A3 |
--disaggregation-transfer-backend | mooncake | ascend | A2, A3 |
--disaggregation-bootstrap-port | 8998 | Type: int | A2, A3 |
—disaggregation-ib-device | None | Type: str | Special for GPU |
—disaggregation-decode-<br/>enable-offload-kvcache | False | False | A2, A3 |
—num-reserved-decode-tokens | 512 | Type: int | A2, A3 |
—disaggregation-decode-<br/>polling-interval | 1 | Type: int | A2, A3 |
Encode prefill disaggregation
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
—enable-adaptive-dispatch-to-encoder | False | bool flag<br/> (set to enable adaptively dispatch) | A2, A3 |
—encoder-only | False | bool flag<br/> (set to launch an encoder-only server) | A2, A3 |
—language-only | False | bool flag<br/> (set to load weights for the language model only) | A2, A3 |
—encoder-transfer-backend | zmq_to_scheduler | zmq_to_scheduler, <br/> zmq_to_tokenizer,<br/> mooncake | A2, A3 |
--encoder-urls | [] | List[str]<br/> (List of encoder server urls) | A2, A3 |
Custom weight loader
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--custom-weight-loader | None | List[str] | A2, A3 |
--weight-loader-disable-mmap | False | bool flag (set to enable) | A2, A3 |
--remote-instance-weight-loader-seed-instance-ip | None | Type: str | A2, A3 |
--remote-instance-weight-loader-seed-instance-service-port | None | Type: int | A2, A3 |
--remote-instance-weight-loader-send-weights-group-ports | None | Type: JSON list | A2, A3 |
--remote-instance-weight-loader-backend | nccl | transfer_engine, <br/> nccl | A2, A3 |
--remote-instance-weight-loader-start-seed-via-transfer-engine | False | bool flag (set to enable) | Special for GPU |
For PD-Multiplexing
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-pdmux | False | bool flag (set to enable) | Special for GPU |
--pdmux-config-path | None | Type: str | Special for GPU |
--sm-group-num | 8 | Type: int | Special for GPU |
For Multi-Modal
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
—enable-broadcast-mm-<br/>inputs-process | False | bool flag<br/> (set to enable) | A2, A3 |
—mm-process-config | None | Type: JSON / Dict | A2, A3 |
—mm-enable-dp-encoder | False | bool flag (set to enable) | A2, A3 |
—limit-mm-data-per-request | None | Type: JSON / Dict | A2, A3 |
For checkpoint decryption
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--decrypted-config-file | None | Type: str | A2, A3 |
--decrypted-draft-config-file | None | Type: str | A2, A3 |
--enable-prefix-mm-cache | False | bool flag (set to enable) | A2, A3 |
Forward hooks
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
—forward-hooks | None | Type: JSON list | A2, A3 |
Configuration file support
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
—config | None | Type: str | A2, A3 |
Other Params
The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.| Argument | Defaults | Options |
|---|---|---|
--checkpoint-engine- wait-weights- before-ready | False | bool flag (set to enable) |
--kt-weight-path | None | Type: str |
--kt-method | AMXINT4 | Type: str |
--kt-cpuinfer | None | Type: int |
--kt-threadpool-count | 2 | Type: int |
--kt-num-gpu-experts | None | Type: int |
--kt-max-deferred-experts-per-token | None | Type: int |
| Argument | Defaults | Options |
|---|---|---|
—tool-server | None | Type: str |
