Model and tokenizer
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--model-path--model | None | Type: str | A2, A3 |
--tokenizer-path | None | Type: str | A2, A3 |
--tokenizer-mode | auto | auto, slow | A2, A3 |
--tokenizer-worker-num | 1 | Type: int | A2, A3 |
--skip-tokenizer-init | False | bool flag (set to enable) | A2, A3 |
--load-format | auto | auto, safetensors, gguf | A2, A3 |
--model-loader- extra-config | {} | Type: str | A2, A3 |
--trust-remote-code | False | bool flag (set to enable) | A2, A3 |
--context-length | None | Type: int | A2, A3 |
--is-embedding | False | bool flag (set to enable) | A2, A3 |
--enable-multimodal | None | bool flag (set to enable) | A2, A3 |
--revision | None | Type: str | A2, A3 |
--model-impl | auto | auto, sglang,transformers | A2, A3 |
HTTP server
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--host | 127.0.0.1 | Type: str | A2, A3 |
--port | 30000 | Type: int | A2, A3 |
--skip-server-warmup | False | bool flag (set to enable) | A2, A3 |
--warmups | None | Type: str | A2, A3 |
--nccl-port | None | Type: int | A2, A3 |
--fastapi-root-path | None | Type: str | A2, A3 |
--grpc-mode | False | False | Planned |
SSL/TLS
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--ssl-keyfile | None | Type: str | A2, A3 |
--ssl-certfile | None | Type: str | A2, A3 |
--ssl-keyfile-password | None | Type: str | A2, A3 |
--enable-ssl-refresh | False | bool flag (set to enable) | A2, A3 |
--enable-http2 | False | bool flag (set to enable) | A2, A3 |
Quantization and data type
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--dtype | auto | auto,float16,bfloat16 | A2, A3 |
--quantization | None | modelslim | A2, A3 |
--quantization-param-path | None | Type: str | Special For GPU |
--kv-cache-dtype | auto | auto | A2, A3 |
--enable-fp32-lm-head | False | bool flag (set to enable) | A2, A3 |
--modelopt-quant | None | Type: str | Special For GPU |
--modelopt-checkpoint-restore-path | None | Type: str | Special For GPU |
--modelopt-checkpoint-save-path | None | Type: str | Special For GPU |
--modelopt-export-path | None | Type: str | Special For GPU |
--quantize-and-serve | False | bool flag (set to enable) | Special For GPU |
--rl-quant-profile | None | Type: str | Special For GPU |
Memory and scheduling
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--mem-fraction-static | None | Type: float | A2, A3 |
--max-running-requests | None | Type: int | A2, A3 |
--prefill-max-requests | None | Type: int | A2, A3 |
--max-queued-requests | None | Type: int | A2, A3 |
--max-total-tokens | None | Type: int | A2, A3 |
--chunked-prefill-size | None | Type: int | A2, A3 |
--max-prefill-tokens | 16384 | Type: int | A2, A3 |
--schedule-policy | fcfs | lpm, fcfs | A2, A3 |
--enable-priority-scheduling | False | bool flag (set to enable) | A2, A3 |
--disable-priority-preemption | False | bool flag (set to enable) | A2, A3 |
--default-priority-value | None | Type: int | A2, A3 |
--schedule-low-priority-values-first | False | bool flag (set to enable) | A2, A3 |
--priority-scheduling-preemption-threshold | 10 | Type: int | A2, A3 |
--schedule-conservativeness | 1.0 | Type: float | A2, A3 |
--page-size | 128 | Type: int | A2, A3 |
--swa-full-tokens-ratio | 0.8 | Type: float | Planned |
--disable-hybrid-swa-memory | False | bool flag (set to enable) | Planned |
—radix-eviction-policy | lru | lru,lfu | A2, A3 |
—enable-prefill-delayer | False | bool flag (set to enable) | A2, A3 |
—prefill-delayer-max-delay-passes | 30 | Type: int | A2, A3 |
—prefill-delayer-token-usage-low-watermark | None | Type: float | A2, A3 |
—prefill-delayer-forward-passes-buckets | None | List[float] | A2, A3 |
—prefill-delayer-wait-seconds-buckets | None | List[float] | A2, A3 |
—abort-on-priority-when-disabled | False | bool flag (set to enable) | A2, A3 |
--enable-dynamic-chunking | False | bool flag (set to enable) | Experimental |
Runtime options
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--device | None | Type: str | A2, A3 |
--tensor-parallel-size--tp-size | 1 | Type: int | A2, A3 |
--pipeline-parallel-size--pp-size | 1 | Type: int; Currently 2 not supported | Experimental |
—attention-context-parallel-size—attn-cp-size | 1 | Type: int; must be equal to —tp-size | A2, A3 |
—moe-data-parallel-size—moe-dp-size | 1 | Type: int | Planned |
—pp-max-micro-batch-size | None | Type: int | Experimental |
—pp-async-batch-depth | None | Type: int | Experimental |
—stream-interval | 1 | Type: int | A2, A3 |
—incremental-streaming-output | False | bool flag (set to enable) | A2, A3 |
—stream-response-default-include-usage | False | bool flag (set to enable) | A2, A3 |
—enable-streaming-session | False | bool flag (set to enable) | A2, A3 |
—random-seed | None | Type: int | A2, A3 |
—constrained-json-whitespace-pattern | None | Type: str | A2, A3 |
—constrained-json-disable-any-whitespace | False | bool flag (set to enable) | A2, A3 |
—watchdog-timeout | 300 | Type: float | A2, A3 |
—soft-watchdog-timeout | 300 | Type: float | A2, A3 |
—dist-timeout | None | Type: int | A2, A3 |
—download-dir | None | Type: str | A2, A3 |
—model-checksum | None | Type: str | Planned |
—base-gpu-id | 0 | Type: int | A2, A3 |
—gpu-id-step | 1 | Type: int | A2, A3 |
—sleep-on-idle | False | bool flag (set to enable) | A2, A3 |
—use-ray | False | bool flag (set to enable) | A2, A3 |
—custom-sigquit-handler | None | Only for engine | A2, A3 |
Logging
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--log-level | info | Type: str | A2, A3 |
--log-level-http | None | Type: str | A2, A3 |
--log-requests | False | bool flag (set to enable) | A2, A3 |
--log-requests-level | 2 | 0, 1, 2, 3 | A2, A3 |
--log-requests-format | text | text, json | A2, A3 |
--crash-dump-folder | None | Type: str | A2, A3 |
--enable-metrics | False | bool flag (set to enable) | A2, A3 |
--enable-mfu-metrics | False | bool flag (set to enable) | A2, A3 |
--enable-metrics-for-all-schedulers | False | bool flag (set to enable) | A2, A3 |
--tokenizer-metrics-custom-labels-header | x-custom-labels | Type: str | A2, A3 |
--tokenizer-metrics-allowed-custom-labels | None | List[str] | A2, A3 |
--extra-metric-labels | None | Type: JSON/Dict | A2, A3 |
--bucket-time-to-first-token | None | List[float] | A2, A3 |
--bucket-inter-token-latency | None | List[float] | A2, A3 |
--bucket-e2e-request-latency | None | List[float] | A2, A3 |
--collect-tokens-histogram | False | bool flag (set to enable) | A2, A3 |
--prompt-tokens-buckets | None | List[str] | A2, A3 |
--generation-tokens-buckets | None | List[str] | A2, A3 |
--gc-warning-threshold-secs | 0.0 | Type: float | A2, A3 |
--decode-log-interval | 40 | Type: int | A2, A3 |
--enable-request-time-stats-logging | False | bool flag (set to enable) | A2, A3 |
--kv-events-config | None | Type: str | Special for GPU |
--enable-trace | False | bool flag (set to enable) | A2, A3 |
--oltp-traces-endpoint | localhost:4317 | Type: str | A2, A3 |
—log-requests-target | None | Type: str | A2, A3 |
—uvicorn-access-log-exclude-prefixes | [] | List[str] | A2, A3 |
RequestMetricsExporter configuration
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--export-metrics-to-file | False | bool flag (set to enable) | A2, A3 |
--export-metrics-to-file-dir | None | Type: str | A2, A3 |
API related
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--api-key | None | Type: str | A2, A3 |
--admin-api-key | None | Type: str | A2, A3 |
--served-model-name | None | Type: str | A2, A3 |
--weight-version | default | Type: str | A2, A3 |
--chat-template | None | Type: str | A2, A3 |
—hf-chat-template-name | None | Type: str | A2, A3 |
—completion-template | None | Type: str | A2, A3 |
—file-storage-path | sglang_storage | Type: str | Unused reserved parameter |
—enable-cache-report | False | bool flag (set to enable) | A2, A3 |
—reasoning-parser | None | deepseek-r1deepseek-v3glm45gpt-osskimiqwen3qwen3-thinkingstep3 | A2, A3 |
—tool-call-parser | None | llama3pythonicqwenqwen3_coder | A2, A3 |
--sampling-defaults | model | openai, model | A2, A3 |
Data parallelism
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--data-parallel-size--dp-size | 1 | Type: int | A2, A3 |
--load-balance-method | auto | auto,round_robin,follow_bootstrap_room,total_requests,total_tokens | A2, A3 |
Multi-node distributed serving
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--dist-init-addr--nccl-init-addr | None | Type: str | A2, A3 |
--nnodes | 1 | Type: int | A2, A3 |
--node-rank | 0 | Type: int | A2, A3 |
Model override args
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--json-model-override-args | {} | Type: str | A2, A3 |
--preferred-sampling-params | None | Type: str | A2, A3 |
LoRA
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-lora | False | Bool flag (set to enable) | A2, A3 |
—enable-lora-overlap-loading | False | Bool flag (set to enable) | A2, A3 |
—max-lora-rank | None | Type: int | A2, A3 |
—lora-target-modules | None | all | A2, A3 |
—lora-paths | None | Type: List[str] / JSON objects | A2, A3 |
—max-loras-per-batch | 8 | Type: int | A2, A3 |
—max-loaded-loras | None | Type: int | A2, A3 |
—lora-eviction-policy | lru | lru,fifo | A2, A3 |
—lora-backend | csgmv | triton,csgmv,ascend,torch_native | A2, A3 |
—experts-shared-outer-loras | None | Type: bool | A2, A3 |
—lora-use-virtual-experts | False | bool flag (set to enable) | A2, A3 |
—lora-strict-loading | False | Type: bool | A2, A3 |
--max-lora-chunk-size | 16 | 16, 32,64, 128 | Special for GPU |
Kernel Backends (Attention, Sampling, Grammar, GEMM)
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--attention-backend | None | ascend | A2, A3 |
--prefill-attention-backend | None | ascend | A2, A3 |
--decode-attention-backend | None | ascend | A2, A3 |
--sampling-backend | None | pytorch,ascend | A2, A3 |
--grammar-backend | None | xgrammar | A2, A3 |
--mm-attention-backend | None | ascend_attn | A2, A3 |
--dsa-prefill-backend | flashmla_sparse | flashmla_sparse,flashmla_decode,fa3,tilelang,aiter | Special for GPU |
--dsa-decode-backend | fa3 | flashmla_prefill,flashmla_kv,fa3,tilelang,aiter | Special for GPU |
--fp8-gemm-backend | auto | auto,deep_gemm,flashinfer_trtllm,flashinfer_cutlass,flashinfer_deepgemm,cutlass,triton,aiter | Special for GPU |
--disable-flashinfer-autotune | False | bool flag (set to enable) | Special for GPU |
Speculative decoding
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--speculative-algorithm | None | EAGLE3,NEXTN | A2, A3 |
--speculative-draft-model-path--speculative-draft-model | None | Type: str | A2, A3 |
--speculative-draft-model-revision | None | Type: str,branch name,tag name,commit id | A2, A3 |
--speculative-draft-load-format | auto | auto,dummy | A2, A3 |
--speculative-num-steps | None | Type: int | A2, A3 |
--speculative-eagle-topk | None | Type: int | A2, A3 |
--speculative-num-draft-tokens | None | Type: int | A2, A3 |
--speculative-accept-threshold-single | 1.0 | Type: float | Special for GPU |
--speculative-accept-threshold-acc | 1.0 | Type: float | Special for GPU |
--speculative-token-map | None | Type: str | A2, A3 |
--speculative-attention-mode | prefill | prefill,decode | A2, A3 |
--speculative-moe-runner-backend | None | auto | A2, A3 |
--speculative-moe-a2a-backend | None | ascend_fuseep (the only supported value on Ascend NPU) | A2, A3 |
--speculative-draft-attention-backend | None | ascend | A2, A3 |
--speculative-draft-model-quantization | None | unquant (the only supported value for speculative decoding on Ascend NPU) | A2, A3 |
Ngram speculative decoding
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--speculative-ngram-min-match-window-size | 1 | Type: int | Experimental |
--speculative-ngram-max-match-window-size | 12 | Type: int | Experimental |
--speculative-ngram-min-bfs-breadth | 1 | Type: int | Experimental |
--speculative-ngram-max-bfs-breadth | 10 | Type: int | Experimental |
--speculative-ngram-match-type | BFS | BFS,PROB | Experimental. BFS uses recency-based expansion; PROB uses frequency-based expansion. |
—speculative-ngram-max-trie-depth | 18 | Type: int | Experimental |
--speculative-ngram-capacity | 10000000 | Type: int | Experimental |
--speculative-ngram-external-corpus-path | None | Type: str | Experimental |
--speculative-ngram-external-sam-budget | 0 | Type: int | Experimental |
--speculative-ngram-external-corpus-max-tokens | 10000000 | Type: int | Experimental |
Expert parallelism
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--expert-parallel-size--ep-size--ep | 1 | Type: int | A2, A3 |
--moe-a2a-backend | none | none,deepep,ascend_fuseep(It is incompatible with eplb) | A2, A3 |
--moe-runner-backend | auto | auto, triton | A2, A3 |
--flashinfer-mxfp4-moe-precision | default | default,bf16 | Special for GPU |
--enable-flashinfer-allreduce-fusion | False | bool flag (set to enable) | Special for GPU |
--deepep-mode | auto | normal, low_latency,auto | A2, A3 |
--deepep-config | None | Type: str | Special for GPU |
--ep-num-redundant-experts | 0 | Type: int | A2, A3 |
--ep-dispatch-algorithm | None | static,dynamic,fake | A2, A3 |
--init-expert-location | trivial | trivial,<path.pt>,<path.json>,<json_string> | A2, A3 |
--enable-eplb | False | bool flag (set to enable) | A2, A3 |
--eplb-algorithm | deepseek | auto,deepseek | A2, A3 |
—eplb-rebalance-num-iterations | 1000 | Type: int | A2, A3 |
—eplb-rebalance-layers-per-chunk | None | Type: int | A2, A3 |
—eplb-min-rebalancing-utilization-threshold | 1.0 | Type: float | A2, A3 |
—expert-distribution-recorder-mode | None | stat,stat_approx,per_pass,per_token | A2, A3 |
—expert-distribution-recorder-buffer-size | None | Type: int | A2, A3 |
—enable-expert-distribution-metrics | False | bool flag (set to enable) | A2, A3 |
—moe-dense-tp-size | None | 1 | A2, A3 |
—elastic-ep-backend | None | none, mooncake | Special for GPU |
--mooncake-ib-device | None | Type: str | Special for GPU |
Mamba Cache
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--max-mamba-cache-size | None | Type: int | A2, A3 |
--mamba-ssm-dtype | float32 | float32,bfloat16,float16 | A2, A3 |
--mamba-full-memory-ratio | 0.9 | Type: float | A2, A3 |
--mamba-scheduler-strategy | auto | auto,no_buffer,extra_buffer | A2, A3 |
--mamba-track-interval | 256 | Type: int | A2, A3 |
Hierarchical cache
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-hierarchical-cache | False | bool flag (set to enable). Currently, mamba cache is not supported. | A2, A3 |
--hicache-ratio | 2.0 | Type: float | A2, A3 |
--hicache-size | 0 | Type: int | A2, A3 |
--hicache-write-policy | write_through | Currently only write_back supported | A2, A3 |
—hicache-io-backend | kernel | kernel_ascend,direct | A2, A3 |
—hicache-mem-layout | layer_first | page_first_direct,page_first_kv_split | A2, A3 |
—hicache-storage-backend | None | file | A2, A3 |
—hicache-storage-prefetch-policy | timeout | best_effort,wait_complete,timeout | Special for GPU |
—hicache-storage-backend-extra-config | None | Type: str | Special for GPU |
LMCache
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-lmcache | False | bool flag (set to enable) | Special for GPU |
--lmcache-config-file | None | Type: str | Special for GPU |
Diffusion LLM
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--dllm-algorithm | None | Type: str | A2, A3 |
--dllm-algorithm-config | None | Type: str | A2, A3 |
Offloading (must be used with --disable-cuda-graph)
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--cpu-offload-gb | 0 | Type: int | A2, A3 |
--offload-group-size | -1 | Type: int (DeepSeek only) | A2, A3 |
--offload-num-in-group | 1 | Type: int (DeepSeek only) | A2, A3 |
--offload-prefetch-step | 1 | Type: int (DeepSeek only) | A2, A3 |
--offload-mode | cpu | cpu (DeepSeek only) meta (DeepSeek only) sharded_gpu (DeepSeek only, only support tp=1 dp>1) | A2, A3 |
Optimization/debug options
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--disable-radix-cache | False | bool flag (set to enable) | A2, A3 |
--cuda-graph-max-bs | None | Type: int | A2, A3 |
--cuda-graph-bs | None | List[int] | A2, A3 |
--disable-cuda-graph | False | bool flag (set to enable) | A2, A3 |
--disable-cuda-graph-padding | False | bool flag (set to enable) | A2, A3 |
--enable-profile-cuda-graph | False | bool flag (set to enable) | A2, A3 |
--enable-cudagraph-gc | False | bool flag (set to enable) | A2, A3 |
--enable-nccl-nvls | False | bool flag (set to enable) | Special for GPU |
--enable-symm-mem | False | bool flag (set to enable) | Special for GPU |
--disable-flashinfer-cutlass-moe-fp4-allgather | False | bool flag (set to enable) | Special for GPU |
--enable-tokenizer-batch-encode | False | bool flag (set to enable) | A2, A3 |
—disable-tokenizer-batch-decode | False | bool flag (set to enable) | A2, A3 |
—disable-custom-all-reduce | False | bool flag (set to enable) | Special for GPU |
—enable-mscclpp | False | bool flag (set to enable) | Special for GPU |
—enable-torch-symm-mem | False | bool flag (set to enable) | Special for GPU |
—disable-overlap-schedule | False | bool flag (set to enable) | A2, A3 |
—enable-mixed-chunk | False | bool flag (set to enable) | A2, A3 |
—enable-dp-attention | False | bool flag (set to enable) | A2, A3 |
—enable-dp-attention-local-control-broadcast | False | bool flag (set to enable) | A2, A3 |
—enable-dp-lm-head | False | bool flag (set to enable) | A2, A3 |
—enable-two-batch-overlap | False | bool flag (set to enable) | Planned |
—enable-single-batch-overlap | False | bool flag (set to enable) | A2, A3 |
—tbo-token-distribution-threshold | 0.48 | Type: float | Planned |
—enable-torch-compile | False | bool flag (set to enable) | A2, A3 |
—enable-torch-compile-debug-mode | False | bool flag (set to enable) | A2, A3 |
—enforce-piecewise-cuda-graph | False | bool flag (set to enable); Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported. | A2, A3 |
—piecewise-cuda-graph-tokens | None | Type: JSON list | A2, A3 |
—piecewise-cuda-graph-compiler | eager | eager | A2, A3 |
—torch-compile-max-bs | 32 | Type: int | A2, A3 |
—piecewise-cuda-graph-max-tokens | None | Type: int | A2, A3 |
—torchao-config | “ | Type: str | Special for GPU |
—enable-nan-detection | False | bool flag (set to enable) | A2, A3 |
—enable-p2p-check | False | bool flag (set to enable) | Special for GPU |
—triton-attention-reduce-in-fp32 | False | bool flag (set to enable) | Special for GPU |
—triton-attention-num-kv-splits | 8 | Type: int | Special for GPU |
—triton-attention-split-tile-size | None | Type: int | Special for GPU |
—delete-ckpt-after-loading | False | bool flag (set to enable) | A2, A3 |
—enable-memory-saver | False | bool flag (set to enable) | A2, A3 |
—enable-weights-cpu-backup | False | bool flag (set to enable) | A2, A3 |
—enable-draft-weights-cpu-backup | False | bool flag (set to enable) | A2, A3 |
—allow-auto-truncate | False | bool flag (set to enable) | A2, A3 |
—enable-custom-logit-processor | False | bool flag (set to enable) | A2, A3 |
—flashinfer-mla-disable-ragged | False | bool flag (set to enable) | Special for GPU |
—disable-shared-experts-fusion | True | bool flag (set to enable) | A2, A3 |
—enforce-shared-experts-fusion | False | bool flag (set to enable) | A2, A3 |
—disable-chunked-prefix-cache | True | bool flag (set to enable) | A2, A3 |
—disable-fast-image-processor | False | bool flag (set to enable) | A2, A3 |
—keep-mm-feature-on-device | False | bool flag (set to enable) | A2, A3 |
—enable-return-hidden-states | False | bool flag (set to enable) | A2, A3 |
—enable-return-routed-experts | False | bool flag (set to enable) | A2, A3 |
—scheduler-recv-interval | 1 | Type: int | A2, A3 |
—numa-node | None | List[int] | A2, A3 |
—enable-deterministic-inference | False | bool flag (set to enable) | Planned |
--rl-on-policy-target | None | fsdp | Planned |
--enable-layerwise-nvtx-marker | False | bool flag (set to enable) | Special for GPU |
--enable-attn-tp-input-scattered | False | bool flag (set to enable) | Experimental |
--enable-dsa-prefill-context-parallel | False | bool flag (set to enable) | A2, A3 |
--enable-prefill-context-parallel | False | bool flag (set to enable) | A2, A3 |
--prefill-cp-mode | in-seq-split | Type: str | A2, A3 |
--enable-fused-qk-norm-rope | False | bool flag (set to enable) | Special for GPU |
--enable-precise-embedding-interpolation | False | bool flag (set to enable) | A2, A3 |
--gc-threshold | None | List[int] | A2, A3 |
Dynamic batch tokenizer
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-dynamic-batch-tokenizer | False | bool flag (set to enable) | A2, A3 |
--dynamic-batch-tokenizer-batch-size | 32 | Type: int | A2, A3 |
--dynamic-batch-tokenizer-batch-timeout | 0.002 | Type: float | A2, A3 |
Debug tensor dumps
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--debug-tensor-dump-output-folder | None | Type: str | A2, A3 |
--debug-tensor-dump-layers | None | List[int] | A2, A3 |
--debug-tensor-dump-input-file | None | Type: str | A2, A3 |
PD disaggregation
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--disaggregation-mode | null | null,prefill,decode | A2, A3 |
--disaggregation-transfer-backend | mooncake | ascend | A2, A3 |
--disaggregation-bootstrap-port | 8998 | Type: int | A2, A3 |
—disaggregation-ib-device | None | Type: str | Special for GPU |
—disaggregation-decode-enable-offload-kvcache | False | False | A2, A3 |
—num-reserved-decode-tokens | 512 | Type: int | A2, A3 |
—disaggregation-decode-polling-interval | 1 | Type: int | A2, A3 |
Encode prefill disaggregation
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
—enable-adaptive-dispatch-to-encoder | False | bool flag (set to enable adaptively dispatch) | A2, A3 |
—encoder-only | False | bool flag (set to launch an encoder-only server) | A2, A3 |
—language-only | False | bool flag (set to load weights for the language model only) | A2, A3 |
—encoder-transfer-backend | zmq_to_scheduler | zmq_to_scheduler, zmq_to_tokenizer,mooncake | A2, A3 |
--encoder-urls | [] | List[str] (List of encoder server urls) | A2, A3 |
Custom weight loader
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--custom-weight-loader | None | List[str] | A2, A3 |
--weight-loader-disable-mmap | False | bool flag (set to enable) | A2, A3 |
--weight-loader-prefetch-checkpoints | False | bool flag (set to enable) | A2, A3 |
--weight-loader-prefetch-num-threads | 4 | Type: int | A2, A3 |
--remote-instance-weight-loader-seed-instance-ip | None | Type: str | Special for GPU |
--remote-instance-weight-loader-seed-instance-service-port | None | Type: int | Special for GPU |
--remote-instance-weight-loader-send-weights-group-ports | None | Type: JSON list | Special for GPU |
--remote-instance-weight-loader-backend | nccl | transfer_engine, nccl | Special for GPU |
--remote-instance-weight-loader-start-seed-via-transfer-engine | False | bool flag (set to enable) | Special for GPU |
For PD-Multiplexing
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-pdmux | False | bool flag (set to enable) | Special for GPU |
--pdmux-config-path | None | Type: str | Special for GPU |
--sm-group-num | 8 | Type: int | Special for GPU |
For Multi-Modal
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
—enable-broadcast-mm-inputs-process | False | bool flag (set to enable) | A2, A3 |
—mm-process-config | None | Type: JSON / Dict | A2, A3 |
—mm-enable-dp-encoder | False | bool flag (set to enable) | A2, A3 |
—limit-mm-data-per-request | None | Type: JSON / Dict | A2, A3 |
For checkpoint decryption
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--decrypted-config-file | None | Type: str | A2, A3 |
--decrypted-draft-config-file | None | Type: str | A2, A3 |
--enable-prefix-mm-cache | False | bool flag (set to enable) | A2, A3 |
Forward hooks
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
—forward-hooks | None | Type: JSON list | A2, A3 |
Configuration file support
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
—config | None | Type: str | A2, A3 |
Other Params
The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.| Argument | Defaults | Options |
|---|---|---|
--checkpoint-engine- wait-weights- before-ready | False | bool flag (set to enable) |
--kt-weight-path | None | Type: str |
--kt-method | AMXINT4 | Type: str |
--kt-cpuinfer | None | Type: int |
--kt-threadpool-count | 2 | Type: int |
--kt-num-gpu-experts | None | Type: int |
--kt-max-deferred-experts-per-token | None | Type: int |
| Argument | Defaults | Options |
|---|---|---|
—tool-server | None | Type: str |
