Skip to main content
This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue. If you want to know the meaning and usage of each parameter, click Server Arguments.

Model and tokenizer

ArgumentDefaultsOptionsServer supported
--model-path
--model
NoneType: strA2, A3
--tokenizer-pathNoneType: strA2, A3
--tokenizer-modeautoauto, slowA2, A3
--tokenizer-worker-num1Type: intA2, A3
--skip-tokenizer-initFalsebool flag (set to enable)A2, A3
--load-formatautoauto, safetensorsA2, A3
--model-loader-
extra-config
Type: strA2, A3
--trust-remote-codeFalsebool flag (set to enable)A2, A3
--context-lengthNoneType: intA2, A3
--is-embeddingFalsebool flag (set to enable)A2, A3
--enable-multimodalNonebool flag (set to enable)A2, A3
--revisionNoneType: strA2, A3
--model-implautoauto, sglang,<br/> transformersA2, A3

HTTP server

ArgumentDefaultsOptionsServer supported
--host127.0.0.1Type: strA2, A3
--port30000Type: intA2, A3
--skip-server-warmupFalsebool flag (set to enable)A2, A3
--warmupsNoneType: strA2, A3
--nccl-portNoneType: intA2, A3
--fastapi-root-pathNoneType: strA2, A3
--grpc-modeFalseFalsePlanned

Quantization and data type

ArgumentDefaultsOptionsServer supported
--dtypeautoauto,<br/> float16,<br/> bfloat16A2, A3
--quantizationNonemodelslimA2, A3
--quantization-param-pathNoneType: strSpecial For GPU
--kv-cache-dtypeautoautoA2, A3
--enable-fp32-lm-headFalsebool flag
(set to enable)
A2, A3
--modelopt-quantNoneType: strSpecial For GPU
--modelopt-checkpoint-
restore-path
NoneType: strSpecial For GPU
--modelopt-checkpoint-
save-path
NoneType: strSpecial For GPU
--modelopt-export-pathNoneType: strSpecial For GPU
--quantize-and-serveFalsebool flag
(set to enable)
Special For GPU
--rl-quant-profileNoneType: strSpecial For GPU

Memory and scheduling

ArgumentDefaultsOptionsServer supported
--mem-fraction-staticNoneType: floatA2, A3
--max-running-requestsNoneType: intA2, A3
--prefill-max-requestsNoneType: intA2, A3
--max-queued-requestsNoneType: intA2, A3
--max-total-tokensNoneType: intA2, A3
--chunked-prefill-sizeNoneType: intA2, A3
--max-prefill-tokens16384Type: intA2, A3
--schedule-policyfcfslpm, fcfsA2, A3
--enable-priority-
scheduling
Falsebool flag
(set to enable)
A2, A3
--schedule-low-priority-
values-first
Falsebool flag
(set to enable)
A2, A3
--priority-scheduling-
preemption-threshold
10Type: intA2, A3
--schedule-conservativeness1.0Type: floatA2, A3
--page-size128Type: intA2, A3
--swa-full-tokens-ratio0.8Type: floatPlanned
--disable-hybrid-swa-memoryFalsebool flag
(set to enable)
Planned
—radix-eviction-policylrulru,<br/>lfuA2, A3
—enable-prefill-delayerFalsebool flag
(set to enable)
A2, A3
—prefill-delayer-max-delay-passes30Type: intA2, A3
—prefill-delayer-token-usage-low-watermarkNoneType: floatA2, A3
—prefill-delayer-forward-passes-bucketsNoneList[float]A2, A3
—prefill-delayer-wait-seconds-bucketsNoneList[float]A2, A3
—abort-on-priority-<br/>when-disabledFalsebool flag
(set to enable)
A2, A3
--enable-dynamic-chunkingFalsebool flag
(set to enable)
Experimental

Runtime options

ArgumentDefaultsOptionsServer supported
--deviceNoneType: strA2, A3
--tensor-parallel-size
--tp-size
1Type: intA2, A3
--pipeline-parallel-size
--pp-size
1Type: int; Currently 2 not supportedExperimental
—attention-context-parallel-size<br/>—attn-cp-size1Type: int; must be equal to —tp-sizeA2, A3
—moe-data-parallel-size<br/>—moe-dp-size1Type: intPlanned
—pp-max-micro-batch-sizeNoneType: intExperimental
—pp-async-batch-depthNoneType: intExperimental
—stream-interval1Type: intA2, A3
—incremental-streaming-outputFalsebool flag (set to enable)A2, A3
—random-seedNoneType: intA2, A3
—constrained-json-<br/>whitespace-patternNoneType: strA2, A3
—constrained-json-<br/>disable-any-whitespaceFalsebool flag (set to enable)A2, A3
—watchdog-timeout300Type: floatA2, A3
—soft-watchdog-timeout300Type: floatA2, A3
—dist-timeoutNoneType: intA2, A3
—download-dirNoneType: strA2, A3
—model-checksumNoneType: strPlanned
—base-gpu-id0Type: intA2, A3
—gpu-id-step1Type: intA2, A3
—sleep-on-idleFalsebool flag (set to enable)A2, A3

Logging

ArgumentDefaultsOptionsServer supported
--log-levelinfoType: strA2, A3
--log-level-httpNoneType: strA2, A3
--log-requestsFalsebool flag
(set to enable)
A2, A3
--log-requests-level20, 1, 2, 3A2, A3
--log-requests-formattexttext, jsonA2, A3
--crash-dump-folderNoneType: strA2, A3
--enable-metricsFalsebool flag
(set to enable)
A2, A3
--enable-metrics-for-
all-schedulers
Falsebool flag
(set to enable)
A2, A3
--tokenizer-metrics-
custom-labels-header
x-custom-labelsType: strA2, A3
--tokenizer-metrics-
allowed-custom-labels
NoneList[str]A2, A3
--bucket-time-to-
first-token
NoneList[float]A2, A3
--bucket-inter-token-
latency
NoneList[float]A2, A3
--bucket-e2e-request-
latency
NoneList[float]A2, A3
--collect-tokens-
histogram
Falsebool flag
(set to enable)
A2, A3
--prompt-tokens-bucketsNoneList[str]A2, A3
--generation-tokens-bucketsNoneList[str]A2, A3
--gc-warning-threshold-secs0.0Type: floatA2, A3
--decode-log-interval40Type: intA2, A3
--enable-request-time-
stats-logging
Falsebool flag
(set to enable)
A2, A3
--kv-events-configNoneType: strSpecial for GPU
--enable-traceFalsebool flag
(set to enable)
A2, A3
--oltp-traces-endpointlocalhost:4317Type: strA2, A3
—log-requests-targetNoneType: strA2, A3
—uvicorn-access-log-exclude-prefixes[]List[str]A2, A3

RequestMetricsExporter configuration

ArgumentDefaultsOptionsServer supported
--export-metrics-to-
file
Falsebool flag
(set to enable)
A2, A3
--export-metrics-to-
file-dir
NoneType: strA2, A3
ArgumentDefaultsOptionsServer supported
--api-keyNoneType: strA2, A3
--admin-api-keyNoneType: strA2, A3
--served-model-nameNoneType: strA2, A3
--weight-versiondefaultType: strA2, A3
--chat-templateNoneType: strA2, A3
—hf-chat-template-nameNoneType: strA2, A3
—completion-templateNoneType: strA2, A3
—enable-cache-reportFalsebool flag<br/> (set to enable)A2, A3
—reasoning-parserNonedeepseek-r1<br/>deepseek-v3<br/>glm45<br/>gpt-oss<br/>kimi<br/>qwen3<br/>qwen3-thinking<br/>step3A2, A3
—tool-call-parserNonellama3<br/> pythonic<br/> qwen<br/> qwen3_coderA2, A3
--sampling-defaultsmodelopenai, modelA2, A3

Data parallelism

ArgumentDefaultsOptionsServer supported
--data-parallel-size
--dp-size
1Type: intA2, A3
--load-balance-methodautoauto,<br/> round_robin,<br/> follow_bootstrap_room,<br/> total_requests,<br/> total_tokensA2, A3

Multi-node distributed serving

ArgumentDefaultsOptionsServer supported
--dist-init-addr
--nccl-init-addr
NoneType: strA2, A3
--nnodes1Type: intA2, A3
--node-rank0Type: intA2, A3

Model override args

ArgumentDefaultsOptionsServer supported
--json-model-override-
args
{}Type: strA2, A3
--preferred-sampling-
params
NoneType: strA2, A3

LoRA

ArgumentDefaultsOptionsServer supported
--enable-loraFalseBool flag
(set to enable)
A2, A3
—enable-lora-overlap-loadingFalseBool flag <br/>(set to enable)A2, A3
—max-lora-rankNoneType: intA2, A3
—lora-target-modulesNoneallA2, A3
—lora-pathsNoneType: List[str] /<br/> JSON objectsA2, A3
—max-loras-per-batch8Type: intA2, A3
—max-loaded-lorasNoneType: intA2, A3
—lora-eviction-policylrulru,<br/> fifoA2, A3
—lora-backendcsgmvtriton,<br/>csgmv,<br/>ascend,<br/>torch_nativeA2, A3
--max-lora-chunk-size1616, 32,<br/> 64, 128Special for GPU

Kernel Backends (Attention, Sampling, Grammar, GEMM)

ArgumentDefaultsOptionsServer supported
--attention-backendNoneascendA2, A3
--prefill-attention-backendNoneascendA2, A3
--decode-attention-backendNoneascendA2, A3
--sampling-backendNonepytorch,<br/>ascendA2, A3
--grammar-backendNonexgrammarA2, A3
--mm-attention-backendNoneascend_attnA2, A3
--nsa-prefill-backendflashmla_sparseflashmla_sparse,<br/> flashmla_decode,<br/>fa3,<br/> tilelang,<br/> aiterSpecial for GPU
--nsa-decode-backendfa3flashmla_prefill,<br/> flashmla_kv,<br/> fa3,<br/>tilelang,<br/> aiterSpecial for GPU
--fp8-gemm-backendautoauto,<br/> deep_gemm,<br/> flashinfer_trtllm,<br/>flashinfer_cutlass,<br/>flashinfer_deepgemm,<br/>cutlass,<br/> triton,<br/> aiterSpecial for GPU
--disable-flashinfer-
autotune
Falsebool flag
(set to enable)
Special for GPU

Speculative decoding

ArgumentDefaultsOptionsServer supported
--speculative-algorithmNoneEAGLE3,<br/> NEXTNA2, A3
--speculative-draft-model-path
--speculative-draft-model
NoneType: strA2, A3
--speculative-draft-model-
revision
NoneType: str,<br/> branch name,<br/> tag name,<br/> commit idA2, A3
--speculative-draft-load-formatautoauto,<br/> dummyA2, A3
--speculative-num-stepsNoneType: intA2, A3
--speculative-eagle-topkNoneType: intA2, A3
--speculative-num-draft-tokensNoneType: intA2, A3
--speculative-accept-
threshold-single
1.0Type: floatSpecial for GPU
--speculative-accept-
threshold-acc
1.0Type: floatSpecial for GPU
--speculative-token-mapNoneType: strA2, A3
--speculative-attention-
mode
prefillprefill,<br/> decodeA2, A3
--speculative-moe-runner-
backend
NoneautoA2, A3
--speculative-moe-a2a-
backend
Noneascend_fuseepA2, A3
--speculative-draft-attention-backendNoneascendA2, A3
--speculative-draft-model-quantizationNoneunquantA2, A3

Ngram speculative decoding

ArgumentDefaultsOptionsServer supported
--speculative-ngram-
min-match-window-size
1Type: intExperimental
--speculative-ngram-
max-match-window-size
12Type: intExperimental
--speculative-ngram-
min-bfs-breadth
1Type: intExperimental
--speculative-ngram-
max-bfs-breadth
10Type: intExperimental
--speculative-ngram-
match-type
BFSBFS,<br/> PROBExperimental. BFS uses recency-based expansion; PROB uses frequency-based expansion.
—speculative-ngram-<br/>max-trie-depth18Type: intExperimental
--speculative-ngram-
capacity
10000000Type: intExperimental

Expert parallelism

ArgumentDefaultsOptionsServer supported
--expert-parallel-size
--ep-size
--ep
1Type: intA2, A3
--moe-a2a-backendnonenone,<br/> deepep,<br/> ascend_fuseep(It is incompatible with eplb)A2, A3
--moe-runner-backendautoauto, tritonA2, A3
--flashinfer-mxfp4-
moe-precision
defaultdefault,<br/> bf16Special for GPU
--enable-flashinfer-
allreduce-fusion
Falsebool flag
(set to enable)
Special for GPU
--deepep-modeautonormal, <br/>low_latency,<br/> autoA2, A3
--deepep-configNoneType: strSpecial for GPU
--ep-num-redundant-experts0Type: intA2, A3
--ep-dispatch-algorithmNonestatic,<br/> dynamic,<br/> fakeA2, A3
--init-expert-locationtrivialtrivial,<br/> <path.pt>,<br/> <path.json>,<br/> <json_string>A2, A3
--enable-eplbFalsebool flag
(set to enable)
A2, A3
--eplb-algorithmdeepseekauto,<br/> deepseekA2, A3
—eplb-rebalance-num-iterations1000Type: intA2, A3
—eplb-rebalance-layers-<br/>per-chunkNoneType: intA2, A3
—eplb-min-rebalancing-<br/>utilization-threshold1.0Type: floatA2, A3
—expert-distribution-<br/>recorder-modeNonestat,<br/> stat_approx,<br/> per_pass,<br/> per_tokenA2, A3
—expert-distribution-<br/>recorder-buffer-sizeNoneType: intA2, A3
—enable-expert-distribution-<br/>metricsFalsebool flag (set to enable)A2, A3
—moe-dense-tp-sizeNone1A2, A3
—elastic-ep-backendNonenone, mooncakeSpecial for GPU
--mooncake-ib-deviceNoneType: strSpecial for GPU

Mamba Cache

ArgumentDefaultsOptionsServer supported
--max-mamba-cache-sizeNoneType: intA2, A3
--mamba-ssm-dtypefloat32float32,<br/>bfloat16,<br/>float16A2, A3
--mamba-full-memory-ratio0.9Type: floatA2, A3
--mamba-scheduler-strategyautoauto,<br/>no_buffer,<br/>extra_bufferA2, A3
--mamba-track-interval256Type: intA2, A3

Hierarchical cache

ArgumentDefaultsOptionsServer supported
--enable-hierarchical-
cache
Falsebool flag<br/> (set to enable).<br/> Currently, mamba cache is not supported.A2, A3
--hicache-ratio2.0Type: floatA2, A3
--hicache-size0Type: intA2, A3
--hicache-write-policywrite_throughCurrently only write_back supportedA2, A3
—hicache-io-backendkernelkernel_ascend,<br/> directA2, A3
—hicache-mem-layoutlayer_firstpage_first_direct,<br/> page_first_kv_splitA2, A3
—hicache-storage-<br/>backendNonefileA2, A3
—hicache-storage-<br/>prefetch-policybest_effortbest_effort,<br/> wait_complete,<br/> timeoutSpecial for GPU
—hicache-storage-<br/>backend-extra-configNoneType: strSpecial for GPU

LMCache

ArgumentDefaultsOptionsServer supported
--enable-lmcacheFalsebool flag
(set to enable)
Special for GPU

Offloading (must be used with --disable-cuda-graph)

ArgumentDefaultsOptionsServer supported
--cpu-offload-gb0Type: intA2, A3
--offload-group-size-1Type: int (DeepSeek only)A2, A3
--offload-num-in-group1Type: int (DeepSeek only)A2, A3
--offload-prefetch-step1Type: int (DeepSeek only)A2, A3
--offload-modecpucpu (DeepSeek only) <br/>meta (DeepSeek only) <br/>sharded_gpu (DeepSeek only)A2, A3

Args for multi-item scoring

ArgumentDefaultsOptionsServer supported
--multi-item-scoring-delimiterNoneType: intA2, A3

Optimization/debug options

ArgumentDefaultsOptionsServer supported
--disable-radix-cacheFalsebool flag
(set to enable)
A2, A3
--cuda-graph-max-bsNoneType: intA2, A3
--cuda-graph-bsNoneList[int]A2, A3
--disable-cuda-graphFalsebool flag
(set to enable)
A2, A3
--disable-cuda-graph-
padding
Falsebool flag
(set to enable)
A2, A3
--enable-profile-
cuda-graph
Falsebool flag
(set to enable)
A2, A3
--enable-cudagraph-gcFalsebool flag
(set to enable)
A2, A3
--enable-nccl-nvlsFalsebool flag
(set to enable)
Special for GPU
--enable-symm-memFalsebool flag
(set to enable)
Special for GPU
--disable-flashinfer-
cutlass-moe-fp4-allgather
Falsebool flag
(set to enable)
Special for GPU
--enable-tokenizer-
batch-encode
Falsebool flag
(set to enable)
A2, A3
—disable-tokenizer-<br/>batch-decodeFalsebool flag
(set to enable)
A2, A3
—disable-custom-<br/>all-reduceFalsebool flag
(set to enable)
Special for GPU
—enable-mscclppFalsebool flag
(set to enable)
Special for GPU
—enable-torch-<br/>symm-memFalsebool flag
(set to enable)
Special for GPU
—disable-overlap<br/>-scheduleFalsebool flag
(set to enable)
A2, A3
—enable-mixed-<br/>chunkFalsebool flag
(set to enable)
A2, A3
—enable-dp-attentionFalsebool flag
(set to enable)
A2, A3
—enable-dp-lm-headFalsebool flag
(set to enable)
A2, A3
—enable-two-<br/>batch-overlapFalsebool flag
(set to enable)
Planned
—enable-single-<br/>batch-overlapFalsebool flag
(set to enable)
A2, A3
—tbo-token-<br/>distribution-threshold0.48Type: floatPlanned
—enable-torch-<br/>compileFalsebool flag<br/> (set to enable)A2, A3
—enable-torch-<br/>compile-debug-modeFalsebool flag
(set to enable)
A2, A3
—enforce-piecewise-<br/>cuda-graphFalsebool flag<br/> (set to enable); <br/> Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported.A2, A3
—piecewise-cuda-<br/>graph-tokensNoneType: JSON<br/> listA2, A3
—piecewise-cuda-<br/>graph-compilereagereagerA2, A3
—torch-compile-max-bs32Type: intA2, A3
—piecewise-cuda-<br/>graph-max-tokensNoneType: intA2, A3
—torchao-configType: strSpecial for GPU
—enable-nan-detectionFalsebool flag<br/> (set to enable)A2, A3
—enable-p2p-checkFalsebool flag
(set to enable)
Special for GPU
—triton-attention-<br/>reduce-in-fp32Falsebool flag
(set to enable)
Special for GPU
—triton-attention-<br/>num-kv-splits8Type: intSpecial for GPU
—triton-attention-<br/>split-tile-sizeNoneType: intSpecial for GPU
—delete-ckpt-<br/>after-loadingFalsebool flag<br/> (set to enable)A2, A3
—enable-memory-saverFalsebool flag
(set to enable)
A2, A3
—enable-weights-<br/>cpu-backupFalsebool flag
(set to enable)
A2, A3
—enable-draft-weights-<br/>cpu-backupFalsebool flag
(set to enable)
A2, A3
—allow-auto-truncateFalsebool flag
(set to enable)
A2, A3
—enable-custom-<br/>logit-processorFalsebool flag
(set to enable)
A2, A3
—flashinfer-mla-<br/>disable-raggedFalsebool flag
(set to enable)
Special for GPU
—disable-shared-<br/>experts-fusionTruebool flag
(set to enable)
A2, A3
—disable-chunked-<br/>prefix-cacheTruebool flag
(set to enable)
A2, A3
—disable-fast-<br/>image-processorFalsebool flag
(set to enable)
A2, A3
—keep-mm-feature-<br/>on-deviceFalsebool flag
(set to enable)
A2, A3
—enable-return-<br/>hidden-statesFalsebool flag
(set to enable)
A2, A3
—enable-return-<br/>routed-expertsFalsebool flag
(set to enable)
A2, A3
—scheduler-recv-<br/>interval1Type: intA2, A3
—numa-nodeNoneList[int]A2, A3
—enable-deterministic-<br/>inferenceFalsebool flag<br/> (set to enable)Planned
--rl-on-policy-targetNonefsdpPlanned
--enable-layerwise-
nvtx-marker
Falsebool flag
(set to enable)
Special for GPU
--enable-attn-tp-
input-scattered
Falsebool flag
(set to enable)
Experimental
--enable-nsa-prefill-
context-parallel
Falsebool flag
(set to enable)
A2, A3
--enable-fused-qk-
norm-rope
Falsebool flag
(set to enable)
Special for GPU

Dynamic batch tokenizer

ArgumentDefaultsOptionsServer supported
--enable-dynamic-
batch-tokenizer
Falsebool flag
(set to enable)
A2, A3
--dynamic-batch-
tokenizer-batch-size
32Type: intA2, A3
--dynamic-batch-
tokenizer-batch-timeout
0.002Type: floatA2, A3

Debug tensor dumps

ArgumentDefaultsOptionsServer supported
--debug-tensor-dump-
output-folder
NoneType: strA2, A3
--debug-tensor-dump-
layers
NoneList[int]A2, A3
--debug-tensor-dump-
input-file
NoneType: strA2, A3

PD disaggregation

ArgumentDefaultsOptionsServer supported
--disaggregation-modenullnull,<br/> prefill,<br/> decodeA2, A3
--disaggregation-transfer-backendmooncakeascendA2, A3
--disaggregation-bootstrap-port8998Type: intA2, A3
—disaggregation-ib-deviceNoneType: strSpecial for GPU
—disaggregation-decode-<br/>enable-offload-kvcacheFalseFalseA2, A3
—num-reserved-decode-tokens512Type: intA2, A3
—disaggregation-decode-<br/>polling-interval1Type: intA2, A3

Encode prefill disaggregation

ArgumentDefaultsOptionsServer supported
—enable-adaptive-dispatch-to-encoderFalsebool flag<br/> (set to enable adaptively dispatch)A2, A3
—encoder-onlyFalsebool flag<br/> (set to launch an encoder-only server)A2, A3
—language-onlyFalsebool flag<br/> (set to load weights for the language model only)A2, A3
—encoder-transfer-backendzmq_to_schedulerzmq_to_scheduler, <br/> zmq_to_tokenizer,<br/> mooncakeA2, A3
--encoder-urls[]List[str]<br/> (List of encoder server urls)A2, A3

Custom weight loader

ArgumentDefaultsOptionsServer supported
--custom-weight-loaderNoneList[str]A2, A3
--weight-loader-disable-
mmap
Falsebool flag
(set to enable)
A2, A3
--remote-instance-weight-
loader-seed-instance-ip
NoneType: strA2, A3
--remote-instance-weight-
loader-seed-instance-service-port
NoneType: intA2, A3
--remote-instance-weight-
loader-send-weights-group-ports
NoneType: JSON
list
A2, A3
--remote-instance-weight-
loader-backend
nccltransfer_engine, <br/> ncclA2, A3
--remote-instance-weight-
loader-start-seed-via-transfer-engine
Falsebool flag
(set to enable)
Special for GPU

For PD-Multiplexing

ArgumentDefaultsOptionsServer supported
--enable-pdmuxFalsebool flag
(set to enable)
Special for GPU
--pdmux-config-pathNoneType: strSpecial for GPU
--sm-group-num8Type: intSpecial for GPU

For Multi-Modal

ArgumentDefaultsOptionsServer supported
—enable-broadcast-mm-<br/>inputs-processFalsebool flag<br/> (set to enable)A2, A3
—mm-process-configNoneType: JSON / DictA2, A3
—mm-enable-dp-encoderFalsebool flag
(set to enable)
A2, A3
—limit-mm-data-per-requestNoneType: JSON / DictA2, A3

For checkpoint decryption

ArgumentDefaultsOptionsServer supported
--decrypted-config-fileNoneType: strA2, A3
--decrypted-draft-config-fileNoneType: strA2, A3
--enable-prefix-mm-cacheFalsebool flag
(set to enable)
A2, A3

Forward hooks

ArgumentDefaultsOptionsServer supported
—forward-hooksNoneType: JSON listA2, A3

Configuration file support

ArgumentDefaultsOptionsServer supported
—configNoneType: strA2, A3

Other Params

The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.
ArgumentDefaultsOptions
--checkpoint-engine-
wait-weights-
before-ready
Falsebool flag (set to enable)
--kt-weight-pathNoneType: str
--kt-methodAMXINT4Type: str
--kt-cpuinferNoneType: int
--kt-threadpool-count2Type: int
--kt-num-gpu-expertsNoneType: int
--kt-max-deferred-
experts-per-token
NoneType: int
The following parameters have some functional deficiencies on community
ArgumentDefaultsOptions
—tool-serverNoneType: str