Server Arguments#
This page provides a list of server arguments used in the command line to configure the behavior
and performance of the language model server during deployment. These arguments enable users to
customize key aspects of the server, including model selection, parallelism policies,
memory management, and optimization techniques.
You can find all arguments by python3 -m sglang.launch_server --help
Common launch commands#
To use a configuration file, create a YAML file with your server arguments and specify it with
--config. CLI arguments will override config file values.# Create config.yaml cat > config.yaml << EOF model-path: meta-llama/Meta-Llama-3-8B-Instruct host: 0.0.0.0 port: 30000 tensor-parallel-size: 2 enable-metrics: true log-requests: true EOF # Launch server with config file python -m sglang.launch_server --config config.yaml
To enable multi-GPU tensor parallelism, add
--tp 2. If it reports the error “peer access is not supported between these two devices”, add--enable-p2p-checkto the server launch command.python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
To enable multi-GPU data parallelism, add
--dp 2. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend SGLang Model Gateway (former Router) for data parallelism.python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of
--mem-fraction-static. The default value is0.9.python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
See hyperparameter tuning on tuning hyperparameters for better performance.
For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See
--shm-sizefor docker and/dev/shmsize update for Kubernetes manifests.If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
To enable fp8 weight quantization, add
--quantization fp8on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.To enable fp8 kv cache quantization, add
--kv-cache-dtype fp8_e4m3or--kv-cache-dtype fp8_e5m2.To enable deterministic inference and batch invariant operations, add
--enable-deterministic-inference. More details can be found in deterministic inference document.If the model does not have a chat template in the Hugging Face tokenizer, you can specify a custom chat template. If the tokenizer has multiple named templates (e.g., ‘default’, ‘tool_use’), you can select one using
--hf-chat-template-name tool_use.To run tensor parallelism on multiple nodes, add
--nnodes 2. If you have two nodes with two GPUs on each node and want to run TP=4, letsgl-dev-0be the hostname of the first node and50000be an available port, you can use the following commands. If you meet deadlock, please try to add--disable-cuda-graph(Note: This feature is out of maintenance and might cause error) To enable
torch.compileacceleration, add--enable-torch-compile. It accelerates small models on small batch sizes. By default, the cache path is located at/tmp/torchinductor_root, you can customize it using environment variableTORCHINDUCTOR_CACHE_DIR. For more details, please refer to PyTorch official documentation and Enabling cache for torch.compile.# Node 0 python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-8B-Instruct \ --tp 4 \ --dist-init-addr sgl-dev-0:50000 \ --nnodes 2 \ --node-rank 0 # Node 1 python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-8B-Instruct \ --tp 4 \ --dist-init-addr sgl-dev-0:50000 \ --nnodes 2 \ --node-rank 1
Please consult the documentation below and server_args.py to learn more about the arguments you may provide when launching a server.
Model and tokenizer#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The path of the model weights. This can be a local folder or a Hugging Face repo ID. |
|
Type: str |
|
The path of the tokenizer. |
|
Type: str |
|
Tokenizer mode. ‘auto’ will use the fast tokenizer if available, and ‘slow’ will always use the slow tokenizer. |
|
|
|
The worker num of the tokenizer manager. |
|
Type: int |
|
If set, skip init tokenizer and pass input_ids in generate request. |
|
bool flag (set to enable) |
|
The format of the model weights to load. “auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available. “pt” will load the weights in the pytorch bin format. “safetensors” will load the weights in the safetensors format. “npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading. “dummy” will initialize the weights with random values, which is mainly for profiling.”gguf” will load the weights in the gguf format. “bitsandbytes” will load the weights using bitsandbytes quantization.”layered” loads weights layer by layer so that one can quantize a layer before loading another to make the peak memory envelope smaller. “flash_rl” will load the weights in flash_rl format. “fastsafetensors” and “private” are also supported. |
|
|
|
Extra config for model loader. This will be passed to the model loader corresponding to the chosen load_format. |
|
Type: str |
|
Whether or not to allow for custom models defined on the Hub in their own modeling files. |
|
bool flag (set to enable) |
|
The model’s maximum context length. Defaults to None (will use the value from the model’s config.json instead). |
|
Type: int |
|
Whether to use a CausalLM as an embedding model. |
|
bool flag (set to enable) |
|
Enable the multimodal functionality for the served model. If the model being served is not multimodal, nothing will happen |
|
bool flag (set to enable) |
|
The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. |
|
Type: str |
|
Which implementation of the model to use. * “auto” will try to use the SGLang implementation if it exists and fall back to the Transformers implementation if no SGLang implementation is available. * “sglang” will use the SGLang model implementation. * “transformers” will use the Transformers model implementation. |
|
Type: str |
HTTP server#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The host of the HTTP server. |
|
Type: str |
|
The port of the HTTP server. |
|
Type: int |
|
App is behind a path based routing proxy. |
|
Type: str |
|
If set, use gRPC server instead of HTTP server. |
|
bool flag (set to enable) |
|
If set, skip warmup. |
|
bool flag (set to enable) |
|
Specify custom warmup functions (csv) to run before server starts eg. –warmups=warmup_name1,warmup_name2 will run the functions |
|
Type: str |
|
The port for NCCL distributed environment setup. Defaults to a random port. |
|
Type: int |
|
If set, the server will wait for initial weights to be loaded via checkpoint-engine or other update methods before serving inference requests. |
|
bool flag (set to enable) |
Quantization and data type#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Data type for model weights and activations. * “auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. * “half” for FP16. Recommended for AWQ quantization. * “float16” is the same as “half”. * “bfloat16” for a balance between precision and range. * “float” is shorthand for FP32 precision. * “float32” for FP32 precision. |
|
|
|
The quantization method. |
|
|
|
Path to the JSON file containing the KV cache scaling factors. This should generally be supplied, when KV cache dtype is FP8. Otherwise, KV cache scaling factors default to 1.0, which may cause accuracy issues. |
|
Type: Optional[str] |
|
Data type for kv cache storage. “auto” will use model data type. “bf16” or “bfloat16” for BF16 KV cache. “fp8_e5m2” and “fp8_e4m3” are supported for CUDA 11.8+. “fp4_e2m1” (only mxfp4) is supported for CUDA 12.8+ and PyTorch 2.8.0+ |
|
|
|
If set, the LM head outputs (logits) are in FP32. |
|
bool flag (set to enable) |
|
The ModelOpt quantization configuration. Supported values: ‘fp8’, ‘int4_awq’, ‘w4a8_awq’, ‘nvfp4’, ‘nvfp4_awq’. This requires the NVIDIA Model Optimizer library to be installed: pip install nvidia-modelopt |
|
Type: str |
|
Path to restore a previously saved ModelOpt quantized checkpoint. If provided, the quantization process will be skipped and the model will be loaded from this checkpoint. |
|
Type: str |
|
Path to save the ModelOpt quantized checkpoint after quantization. This allows reusing the quantized model in future runs. |
|
Type: str |
|
Path to export the quantized model in HuggingFace format after ModelOpt quantization. The exported model can then be used directly with SGLang for inference. If not provided, the model will not be exported. |
|
Type: str |
|
Quantize the model with ModelOpt and immediately serve it without exporting. This is useful for development and prototyping. For production, it’s recommended to use separate quantization and deployment steps. |
|
bool flag (set to enable) |
|
Path to the FlashRL quantization profile. Required when using –load-format flash_rl. |
|
Type: str |
Memory and scheduling#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
|
Type: float |
|
The maximum number of running requests. |
|
Type: int |
|
The maximum number of queued requests. This option is ignored when using disaggregation-mode. |
|
Type: int |
|
The maximum number of tokens in the memory pool. If not specified, it will be automatically calculated based on the memory usage fraction. This option is typically used for development and debugging purposes. |
|
Type: int |
|
The maximum number of tokens in a chunk for the chunked prefill. Setting this to -1 means disabling chunked prefill. |
|
Type: int |
|
The maximum number of requests in a prefill batch. If not specified, there is no limit. |
|
Type: int |
|
Enable dynamic chunk size adjustment for pipeline parallelism. When enabled, chunk sizes are dynamically calculated based on fitted function to maintain consistent execution time across chunks. |
|
bool flag (set to enable) |
|
The maximum number of tokens in a prefill batch. The real bound will be the maximum of this value and the model’s maximum context length. |
|
Type: int |
|
The scheduling policy of the requests. |
|
|
|
Enable priority scheduling. Requests with higher priority integer values will be scheduled first by default. |
|
bool flag (set to enable) |
|
If set, abort requests that specify a priority when priority scheduling is disabled. |
|
bool flag (set to enable) |
|
If specified with –enable-priority-scheduling, the scheduler will schedule requests with lower priority integer values first. |
|
bool flag (set to enable) |
|
Minimum difference in priorities for an incoming request to have to preempt running request(s). |
|
Type: int |
|
How conservative the schedule policy is. A larger value means more conservative scheduling. Use a larger value if you see requests being retracted frequently. |
|
Type: float |
|
The number of tokens in a page. |
|
Type: int |
|
The ratio of SWA layer KV tokens / full layer KV tokens, regardless of the number of swa:full layers. It should be between 0 and 1. E.g. 0.5 means if each swa layer has 50 tokens, then each full layer has 100 tokens. |
|
Type: float |
|
Disable the hybrid SWA memory. |
|
bool flag (set to enable) |
|
The eviction policy of radix trees. ‘lru’ stands for Least Recently Used, ‘lfu’ stands for Least Frequently Used. |
|
|
|
Enable prefill delayer for DP attention to reduce idle time. |
|
bool flag (set to enable) |
|
Maximum forward passes to delay prefill. |
|
Type: int |
|
Token usage low watermark for prefill delayer. |
|
Type: float |
|
Custom buckets for prefill delayer forward passes histogram. 0 and max_delay_passes-1 will be auto-added. |
|
List[float] |
|
Custom buckets for prefill delayer wait seconds histogram. 0 will be auto-added. |
|
List[float] |
Runtime options#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The device to use (‘cuda’, ‘xpu’, ‘hpu’, ‘npu’, ‘cpu’). Defaults to auto-detection if not specified. |
|
Type: str |
|
The tensor parallelism size. |
|
Type: int |
|
The pipeline parallelism size. |
|
Type: int |
|
The attention context parallelism size. |
|
Type: int |
|
The moe data parallelism size. |
|
Type: int |
|
The maximum micro batch size in pipeline parallelism. |
|
Type: int |
|
The async batch depth of pipeline parallelism. |
|
Type: int |
|
The interval (or buffer size) for streaming in terms of the token length. A smaller value makes streaming smoother, while a larger value makes the throughput higher |
|
Type: int |
|
Whether to output as a sequence of disjoint segments. |
|
bool flag (set to enable) |
|
The random seed. |
|
Type: int |
|
(outlines and llguidance backends only) Regex pattern for syntactic whitespaces allowed in JSON constrained output. For example, to allow the model to generate consecutive whitespaces, set the pattern to [\n\t ]* |
|
Type: str |
|
(xgrammar and llguidance backends only) Enforce compact representation in JSON constrained output. |
|
bool flag (set to enable) |
|
Set watchdog timeout in seconds. If a forward batch takes longer than this, the server will crash to prevent hanging. |
|
Type: float |
|
Set soft watchdog timeout in seconds. If a forward batch takes longer than this, the server will dump information for debugging. |
|
Type: float |
|
Set timeout for torch.distributed initialization. |
|
Type: int |
|
Model download directory for huggingface. |
|
Type: str |
|
Model file integrity verification. If provided without value, uses model-path as HF repo ID. Otherwise, provide checksums JSON file path or HuggingFace repo ID. |
|
Type: str |
|
The base GPU ID to start allocating GPUs from. Useful when running multiple instances on the same machine. |
|
Type: int |
|
The delta between consecutive GPU IDs that are used. For example, setting it to 2 will use GPU 0,2,4,… |
|
Type: int |
|
Reduce CPU usage when sglang is idle. |
|
bool flag (set to enable) |
|
Register a custom sigquit handler so you can do additional cleanup after the server is shutdown. This is only available for Engine, not for CLI. |
|
Type: str |
Logging#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The logging level of all loggers. |
|
Type: str |
|
The logging level of HTTP server. If not set, reuse –log-level by default. |
|
Type: str |
|
Log metadata, inputs, outputs of all requests. The verbosity is decided by –log-requests-level |
|
bool flag (set to enable) |
|
0: Log metadata (no sampling parameters). 1: Log metadata and sampling parameters. 2: Log metadata, sampling parameters and partial input/output. 3: Log every input/output. |
|
|
|
Format for request logging: ‘text’ (human-readable) or ‘json’ (structured) |
|
|
|
Target(s) for request logging: ‘stdout’ and/or directory path(s) for file output. Can specify multiple targets, e.g., ‘–log-requests-target stdout /my/path’. |
|
List[str] |
|
Exclude uvicorn access logs whose request path starts with any of these prefixes. Defaults to empty (disabled). |
|
List[str] |
|
Folder path to dump requests from the last 5 min before a crash (if any). If not specified, crash dumping is disabled. |
|
Type: str |
|
Show time cost of custom marks. |
|
bool flag (set to enable) |
|
Enable log prometheus metrics. |
|
bool flag (set to enable) |
|
Enable –enable-metrics-for-all-schedulers when you want schedulers on all TP ranks (not just TP 0) to record request metrics separately. This is especially useful when dp_attention is enabled, as otherwise all metrics appear to come from TP 0. |
|
bool flag (set to enable) |
|
Specify the HTTP header for passing custom labels for tokenizer metrics. |
|
Type: str |
|
The custom labels allowed for tokenizer metrics. The labels are specified via a dict in ‘–tokenizer-metrics-custom-labels-header’ field in HTTP requests, e.g., {‘label1’: ‘value1’, ‘label2’: ‘value2’} is allowed if ‘–tokenizer-metrics-allowed-custom-labels label1 label2’ is set. |
|
List[str] |
|
The buckets of time to first token, specified as a list of floats. |
|
List[float] |
|
The buckets of inter-token latency, specified as a list of floats. |
|
List[float] |
|
The buckets of end-to-end request latency, specified as a list of floats. |
|
List[float] |
|
Collect prompt/generation tokens histogram. |
|
bool flag (set to enable) |
|
The buckets rule of prompt tokens. Supports 3 rule types: ‘default’ uses predefined buckets; ‘tse |
|
List[str] |
|
The buckets rule for generation tokens histogram. Supports 3 rule types: ‘default’ uses predefined buckets; ‘tse |
|
List[str] |
|
The threshold for long GC warning. If a GC takes longer than this, a warning will be logged. Set to 0 to disable. |
|
Type: float |
|
The log interval of decode batch. |
|
Type: int |
|
Enable per request time stats logging |
|
bool flag (set to enable) |
|
Config in json format for NVIDIA dynamo KV event publishing. Publishing will be enabled if this flag is used. |
|
Type: str |
|
Enable opentelemetry trace |
|
bool flag (set to enable) |
|
Config opentelemetry collector endpoint if –enable-trace is set. format: |
|
Type: str |
RequestMetricsExporter configuration#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Export performance metrics for each request to local file (e.g. for forwarding to external systems). |
|
bool flag (set to enable) |
|
Directory path for writing performance metrics files (required when –export-metrics-to-file is enabled). |
|
Type: str |
Data parallelism#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The data parallelism size. |
|
Type: int |
|
The load balancing strategy for data parallelism. The |
|
|
Multi-node distributed serving#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The host address for initializing distributed backend (e.g., |
|
Type: str |
|
The number of nodes. |
|
Type: int |
|
The node rank. |
|
Type: int |
Model override args#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
A dictionary in JSON string format used to override default model configurations. |
|
Type: str |
|
json-formatted sampling settings that will be returned in /get_model_info |
|
Type: str |
LoRA#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Enable LoRA support for the model. This argument is automatically set to |
|
Bool flag (set to enable) |
|
Enable asynchronous LoRA weight loading in order to overlap H2D transfers with GPU compute. This should be enabled if you find that your LoRA workloads are bottlenecked by adapter weight loading, for example when frequently loading large LoRA adapters. |
|
Bool flag (set to enable) |
|
The maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in |
|
Type: int |
|
The union set of all target modules where LoRA should be applied (e.g., |
|
|
|
The list of LoRA adapters to load. Each adapter must be specified in one of the following formats: |
|
Type: List[str] / JSON objects |
|
Maximum number of adapters for a running batch, including base-only requests. |
|
Type: int |
|
If specified, limits the maximum number of LoRA adapters loaded in CPU memory at a time. Must be ≥ |
|
Type: int |
|
LoRA adapter eviction policy when the GPU memory pool is full. |
|
|
|
Choose the kernel backend for multi-LoRA serving. |
|
|
|
Maximum chunk size for the ChunkedSGMV LoRA backend. Only used when |
|
|
Kernel Backends (Attention, Sampling, Grammar, GEMM)#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Choose the kernels for attention layers. |
|
|
|
Choose the kernels for prefill attention layers (have priority over –attention-backend). |
|
|
|
Choose the kernels for decode attention layers (have priority over –attention-backend). |
|
|
|
Choose the kernels for sampling layers. |
|
|
|
Choose the backend for grammar-guided decoding. |
|
|
|
Set multimodal attention backend. |
|
|
|
Choose the NSA backend for the prefill stage (overrides |
|
|
|
Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides |
|
|
|
Choose the runner backend for Blockwise FP8 GEMM operations. Options: ‘auto’ (default, auto-selects based on hardware), ‘deep_gemm’ (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), ‘flashinfer_trtllm’ (optimal for Blackwell and low-latency), ‘flashinfer_deepgemm’ (Hopper SM90 only; uses swapAB optimization for small M dimensions in decoding), ‘cutlass’ (optimal for Hopper/Blackwell GPUs and high-throughput), ‘triton’ (fallback, widely compatible), ‘aiter’ (ROCm only). NOTE: This replaces the deprecated environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8. |
|
|
|
Choose the runner backend for NVFP4 GEMM operations. Options: ‘flashinfer_cutlass’ (default), ‘auto’ (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), ‘flashinfer_cudnn’ (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), ‘flashinfer_trtllm’ (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback. NOTE: This replaces the deprecated environment variable SGLANG_FLASHINFER_FP4_GEMM_BACKEND. |
|
|
|
Flashinfer autotune is enabled by default. Set this flag to disable the autotune. |
|
bool flag (set to enable) |
Speculative decoding#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Speculative algorithm. |
|
|
|
The path of the draft model weights. This can be a local folder or a Hugging Face repo ID. |
|
Type: str |
|
The specific draft model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. |
|
Type: str |
|
The format of the draft model weights to load. If not specified, will use the same format as –load-format. Use ‘dummy’ to initialize draft model weights with random values for profiling. |
|
Same as –load-format options |
|
The number of steps sampled from draft model in Speculative Decoding. |
|
Type: int |
|
The number of tokens sampled from the draft model in eagle2 each step. |
|
Type: int |
|
The number of tokens sampled from the draft model in Speculative Decoding. |
|
Type: int |
|
Accept a draft token if its probability in the target model is greater than this threshold. |
|
Type: float |
|
The accept probability of a draft token is raised from its target probability p to min(1, p / threshold_acc). |
|
Type: float |
|
The path of the draft model’s small vocab table. |
|
Type: str |
|
Attention backend for speculative decoding operations (both target verify and draft extend). Can be one of ‘prefill’ (default) or ‘decode’. |
|
|
|
Attention backend for speculative decoding drafting. |
|
Same as attention backend options |
|
MOE backend for EAGLE speculative decoding, see –moe-runner-backend for options. Same as moe runner backend if unset. |
|
Same as –moe-runner-backend options |
|
MOE A2A backend for EAGLE speculative decoding, see –moe-a2a-backend for options. Same as moe a2a backend if unset. |
|
Same as –moe-a2a-backend options |
|
The quantization method for speculative model. |
|
Same as –quantization options |
Ngram speculative decoding#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The minimum window size for pattern matching in ngram speculative decoding. |
|
Type: int |
|
The maximum window size for pattern matching in ngram speculative decoding. |
|
Type: int |
|
The minimum breadth for BFS (Breadth-First Search) in ngram speculative decoding. |
|
Type: int |
|
The maximum breadth for BFS (Breadth-First Search) in ngram speculative decoding. |
|
Type: int |
|
The match type for cache tree. |
|
|
|
The branch length for ngram speculative decoding. |
|
Type: int |
|
The cache capacity for ngram speculative decoding. |
|
Type: int |
Multi-layer Eagle speculative decoding#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Enable multi-layer Eagle speculative decoding. |
|
bool flag (set to enable) |
MoE#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The expert parallelism size. |
|
Type: int |
|
Select the backend for all-to-all communication for expert parallelism. |
|
|
|
Choose the runner backend for MoE. |
|
|
|
Choose the computation precision of flashinfer mxfp4 moe |
|
|
|
Enable FlashInfer allreduce fusion with Residual RMSNorm. |
|
bool flag (set to enable) |
|
Select the mode when enable DeepEP MoE, could be |
|
|
|
Allocate this number of redundant experts in expert parallel. |
|
Type: int |
|
The algorithm to choose ranks for redundant experts in expert parallel. |
|
Type: str |
|
Initial location of EP experts. |
|
Type: str |
|
Enable EPLB algorithm |
|
bool flag (set to enable) |
|
Chosen EPLB algorithm |
|
Type: str |
|
Number of iterations to automatically trigger a EPLB re-balance. |
|
Type: int |
|
Number of layers to rebalance per forward pass. |
|
Type: int |
|
Minimum threshold for GPU average utilization to trigger EPLB rebalancing. Must be in the range [0.0, 1.0]. |
|
Type: float |
|
Mode of expert distribution recorder. |
|
Type: str |
|
Circular buffer size of expert distribution recorder. Set to -1 to denote infinite buffer. |
|
Type: int |
|
Enable logging metrics for expert balancedness |
|
bool flag (set to enable) |
|
Tuned DeepEP config suitable for your own cluster. It can be either a string with JSON content or a file path. |
|
Type: str |
|
TP size for MoE dense MLP layers. This flag is useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports. |
|
Type: int |
|
Specify the collective communication backend for elastic EP. Currently supports ‘mooncake’. |
|
|
|
The InfiniBand devices for Mooncake Backend transfer, accepts multiple comma-separated devices (e.g., –mooncake-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when Mooncake Backend is enabled. |
|
Type: str |
Mamba Cache#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The maximum size of the mamba cache. |
|
Type: int |
|
The data type of the SSM states in mamba cache. |
|
|
|
The ratio of mamba state memory to full kv cache memory. |
|
Type: float |
|
The strategy to use for mamba scheduler. |
|
|
|
The interval (in tokens) to track the mamba state during decode. Only used when |
|
Type: int |
Hierarchical cache#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Enable hierarchical cache |
|
bool flag (set to enable) |
|
The ratio of the size of host KV cache memory pool to the size of device pool. |
|
Type: float |
|
The size of host KV cache memory pool in gigabytes, which will override the hicache_ratio if set. |
|
Type: int |
|
The write policy of hierarchical cache. |
|
|
|
The IO backend for KV cache transfer between CPU and GPU |
|
|
|
The layout of host memory pool for hierarchical cache. |
|
|
|
The storage backend for hierarchical KV cache. Built-in backends: file, mooncake, hf3fs, nixl, aibrix. For dynamic backend, use –hicache-storage-backend-extra-config to specify: backend_name (custom name), module_path (Python module path), class_name (backend class name). |
|
|
|
Control when prefetching from the storage backend should stop. |
|
|
|
A dictionary in JSON string format, or a string starting with a |
|
Type: str |
Hierarchical sparse attention#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
A dictionary in JSON string format for hierarchical sparse attention configuration. Required fields: |
|
Type: str |
LMCache#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Using LMCache as an alternative hierarchical cache solution |
|
bool flag (set to enable) |
Ktransformers#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
[ktransformers parameter] The path of the quantized expert weights for amx kernel. A local folder. |
|
Type: str |
|
[ktransformers parameter] Quantization formats for CPU execution. |
|
Type: str |
|
[ktransformers parameter] The number of CPUInfer threads. |
|
Type: int |
|
[ktransformers parameter] One-to-one with the number of NUMA nodes (one thread pool per NUMA). |
|
Type: int |
|
[ktransformers parameter] The number of GPU experts. |
|
Type: int |
|
[ktransformers parameter] Maximum number of experts deferred to CPU per token. All MoE layers except the final one use this value; the final layer always uses 0. |
|
Type: int |
Diffusion LLM#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The diffusion LLM algorithm, such as LowConfidence. |
|
Type: str |
|
The diffusion LLM algorithm configurations. Must be a YAML file. |
|
Type: str |
Double Sparsity#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Enable double sparsity attention |
|
bool flag (set to enable) |
|
The path of the double sparsity channel config |
|
Type: str |
|
The number of heavy channels in double sparsity attention |
|
Type: int |
|
The number of heavy tokens in double sparsity attention |
|
Type: int |
|
The type of heavy channels in double sparsity attention |
|
Type: str |
|
The minimum decode sequence length required before the double-sparsity backend switches from the dense fallback to the sparse decode kernel. |
|
Type: int |
Offloading#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
How many GBs of RAM to reserve for CPU offloading. |
|
Type: int |
|
Number of layers per group in offloading. |
|
Type: int |
|
Number of layers to be offloaded within a group. |
|
Type: int |
|
Steps to prefetch in offloading. |
|
Type: int |
|
Mode of offloading. |
|
Type: str |
Args for multi-item scoring#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Delimiter token ID for multi-item scoring. Used to combine Query and Items into a single sequence: Query |
|
Type: int |
Optimization/debug options#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Disable RadixAttention for prefix caching. |
|
bool flag (set to enable) |
|
Set the maximum batch size for cuda graph. It will extend the cuda graph capture batch size to this value. |
|
Type: int |
|
Set the list of batch sizes for cuda graph. |
|
List[int] |
|
Disable cuda graph. |
|
bool flag (set to enable) |
|
Disable cuda graph when padding is needed. Still uses cuda graph when padding is not needed. |
|
bool flag (set to enable) |
|
Enable profiling of cuda graph capture. |
|
bool flag (set to enable) |
|
Enable garbage collection during CUDA graph capture. If disabled (default), GC is frozen during capture to speed up the process. |
|
bool flag (set to enable) |
|
Enable layerwise NVTX profiling annotations for the model. This adds NVTX markers to every layer for detailed per-layer performance analysis with Nsight Systems. |
|
bool flag (set to enable) |
|
Enable NCCL NVLS for prefill heavy requests when available. |
|
bool flag (set to enable) |
|
Enable NCCL symmetric memory for fast collectives. |
|
bool flag (set to enable) |
|
Disables quantize before all-gather for flashinfer cutlass moe. |
|
bool flag (set to enable) |
|
Enable batch tokenization for improved performance when processing multiple text inputs. Do not use with image inputs, pre-tokenized input_ids, or input_embeds. |
|
bool flag (set to enable) |
|
Disable batch decoding when decoding multiple completions. |
|
bool flag (set to enable) |
|
Disable disk cache of outlines to avoid possible crashes related to file system or high concurrency. |
|
bool flag (set to enable) |
|
Disable the custom all-reduce kernel and fall back to NCCL. |
|
bool flag (set to enable) |
|
Enable using mscclpp for small messages for all-reduce kernel and fall back to NCCL. |
|
bool flag (set to enable) |
|
Enable using torch symm mem for all-reduce kernel and fall back to NCCL. Only supports CUDA device SM90 and above. SM90 supports world size 4, 6, 8. SM10 supports world size 6, 8. |
|
bool flag (set to enable) |
|
Disable the overlap scheduler, which overlaps the CPU scheduler with GPU model worker. |
|
bool flag (set to enable) |
|
Enabling mixing prefill and decode in a batch when using chunked prefill. |
|
bool flag (set to enable) |
|
Enabling data parallelism for attention and tensor parallelism for FFN. The dp size should be equal to the tp size. Currently DeepSeek-V2 and Qwen 2/3 MoE models are supported. |
|
bool flag (set to enable) |
|
Enable vocabulary parallel across the attention TP group to avoid all-gather across DP groups, optimizing performance under DP attention. |
|
bool flag (set to enable) |
|
Enabling two micro batches to overlap. |
|
bool flag (set to enable) |
|
Let computation and communication overlap within one micro batch. |
|
bool flag (set to enable) |
|
The threshold of token distribution between two batches in micro-batch-overlap, determines whether to two-batch-overlap or two-chunk-overlap. Set to 0 denote disable two-chunk-overlap. |
|
Type: float |
|
Optimize the model with torch.compile. Experimental feature. |
|
bool flag (set to enable) |
|
Enable debug mode for torch compile. |
|
bool flag (set to enable) |
|
Optimize the model with piecewise cuda graph for extend/prefill only. Experimental feature. |
|
bool flag (set to enable) |
|
Set the list of tokens when using piecewise cuda graph. |
|
Type: JSON list |
|
Set the compiler for piecewise cuda graph. Choices are: eager, inductor. |
|
|
|
Set the maximum batch size when using torch compile. |
|
Type: int |
|
Set the maximum tokens when using piecewise cuda graph. |
|
Type: int |
|
Optimize the model with torchao. Experimental feature. Current choices are: int8dq, int8wo, int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-per_row |
`` |
Type: str |
|
Enable the NaN detection for debugging purposes. |
|
bool flag (set to enable) |
|
Enable P2P check for GPU access, otherwise the p2p access is allowed by default. |
|
bool flag (set to enable) |
|
Cast the intermediate attention results to fp32 to avoid possible crashes related to fp16. This only affects Triton attention kernels. |
|
bool flag (set to enable) |
|
The number of KV splits in flash decoding Triton kernel. Larger value is better in longer context scenarios. The default value is 8. |
|
Type: int |
|
The size of split KV tile in flash decoding Triton kernel. Used for deterministic inference. |
|
Type: int |
|
Run multiple continuous decoding steps to reduce scheduling overhead. This can potentially increase throughput but may also increase time-to-first-token latency. The default value is 1, meaning only run one decoding step at a time. |
|
Type: int |
|
Delete the model checkpoint after loading the model. |
|
bool flag (set to enable) |
|
Allow saving memory using release_memory_occupation and resume_memory_occupation |
|
bool flag (set to enable) |
|
Save model weights to CPU memory during release_weights_occupation and resume_weights_occupation |
|
bool flag (set to enable) |
|
Save draft model weights to CPU memory during release_weights_occupation and resume_weights_occupation |
|
bool flag (set to enable) |
|
Allow automatically truncating requests that exceed the maximum input length instead of returning an error. |
|
bool flag (set to enable) |
|
Enable users to pass custom logit processors to the server (disabled by default for security) |
|
bool flag (set to enable) |
|
Not using ragged prefill wrapper when running flashinfer mla |
|
bool flag (set to enable) |
|
Disable shared experts fusion optimization for deepseek v3/r1. |
|
bool flag (set to enable) |
|
Disable chunked prefix cache feature for deepseek, which should save overhead for short sequences. |
|
bool flag (set to enable) |
|
Adopt base image processor instead of fast image processor. |
|
bool flag (set to enable) |
|
Keep multimodal feature tensors on device after processing to save D2H copy. |
|
bool flag (set to enable) |
|
Enable returning hidden states with responses. |
|
bool flag (set to enable) |
|
Enable returning routed experts of each layer with responses. |
|
bool flag (set to enable) |
|
The interval to poll requests in scheduler. Can be set to >1 to reduce the overhead of this. |
|
Type: int |
|
Sets the numa node for the subprocesses. i-th element corresponds to i-th subprocess. |
|
List[int] |
|
Enable deterministic inference mode with batch invariant ops. |
|
bool flag (set to enable) |
|
The training system that SGLang needs to match for true on-policy. |
|
|
|
Allow input of attention to be scattered when only using tensor parallelism, to reduce the computational load of operations such as qkv latent. |
|
bool flag (set to enable) |
|
Enable context parallelism used in the long sequence prefill phase of DeepSeek v3.2. |
|
bool flag (set to enable) |
|
Token splitting mode for the prefill phase of DeepSeek v3.2 under context parallelism. Optional values: |
|
|
|
Enable fused qk normalization and rope rotary embedding. |
|
bool flag (set to enable) |
|
Enable corner alignment for resize of embeddings grid to ensure more accurate(but slower) evaluation of interpolated embedding values. |
|
bool flag (set to enable) |
Dynamic batch tokenizer#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Enable async dynamic batch tokenizer for improved performance when multiple requests arrive concurrently. |
|
bool flag (set to enable) |
|
[Only used if –enable-dynamic-batch-tokenizer is set] Maximum batch size for dynamic batch tokenizer. |
|
Type: int |
|
[Only used if –enable-dynamic-batch-tokenizer is set] Timeout in seconds for batching tokenization requests. |
|
Type: float |
Debug tensor dumps#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The output folder for dumping tensors. |
|
Type: str |
|
The layer ids to dump. Dump all layers if not specified. |
|
Type: JSON list |
|
The input filename for dumping tensors |
|
Type: str |
|
Inject the outputs from jax as the input of every layer. |
|
Type: str |
PD disaggregation#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Only used for PD disaggregation. “prefill” for prefill-only server, and “decode” for decode-only server. If not specified, it is not PD disaggregated |
|
|
|
The backend for disaggregation transfer. Default is mooncake. |
|
|
|
Bootstrap server port on the prefill server. Default is 8998. |
|
Type: int |
|
Decode tp size. If not set, it matches the tp size of the current engine. This is only set on the prefill server. |
|
Type: int |
|
Decode dp size. If not set, it matches the dp size of the current engine. This is only set on the prefill server. |
|
Type: int |
|
Prefill pp size. If not set, it is default to 1. This is only set on the decode server. |
|
Type: int |
|
The InfiniBand devices for disaggregation transfer, accepts single device (e.g., –disaggregation-ib-device mlx5_0) or multiple comma-separated devices (e.g., –disaggregation-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when mooncake backend is enabled. |
|
Type: str |
|
Enable async KV cache offloading on decode server (PD mode). |
|
bool flag (set to enable) |
|
Number of decode tokens that will have memory reserved when adding new request to the running batch. |
|
Type: int |
|
The interval to poll requests in decode server. Can be set to >1 to reduce the overhead of this. |
|
Type: int |
Encode prefill disaggregation#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
For MLLM with an encoder, launch an encoder-only server |
|
bool flag (set to enable) |
|
For VLM, load weights for the language model only. |
|
bool flag (set to enable) |
|
The backend for encoder disaggregation transfer. Default is zmq_to_scheduler. |
|
|
|
List of encoder server urls. |
|
Type: JSON list |
Custom weight loader#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The custom dataloader which used to update the model. Should be set with a valid import path, such as my_package.weight_load_func |
|
List[str] |
|
Disable mmap while loading weight using safetensors. |
|
bool flag (set to enable) |
|
The ip of the seed instance for loading weights from remote instance. |
|
Type: str |
|
The service port of the seed instance for loading weights from remote instance. |
|
Type: int |
|
The communication group ports for loading weights from remote instance. |
|
Type: JSON list |
|
The backend for loading weights from remote instance. Can be ‘transfer_engine’ or ‘nccl’. Default is ‘nccl’. |
|
|
|
Start seed server via transfer engine backend for remote instance weight loader. |
|
bool flag (set to enable) |
For PD-Multiplexing#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Enable PD-Multiplexing, PD running on greenctx stream. |
|
bool flag (set to enable) |
|
The path of the PD-Multiplexing config file. |
|
Type: str |
|
Number of sm partition groups. |
|
Type: int |
Configuration file support#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
Read CLI options from a config file. Must be a YAML file with configuration options. |
|
Type: str |
For Multi-Modal#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The max concurrent calls for async mm data processing. |
|
Type: int |
|
The timeout for each multi-modal request in seconds. |
|
Type: int |
|
Enable broadcast mm-inputs process in scheduler. |
|
bool flag (set to enable) |
|
Multimodal preprocessing config, a json config contains keys: |
|
Type: JSON / Dict |
|
Enabling data parallelism for mm encoder. The dp size will be set to the tp size automatically. |
|
bool flag (set to enable) |
|
Limit the number of multimodal inputs per request. e.g. ‘{“image”: 1, “video”: 1, “audio”: 1}’ |
|
Type: JSON / Dict |
For checkpoint decryption#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
The path of the decrypted config file. |
|
Type: str |
|
The path of the decrypted draft config file. |
|
Type: str |
|
Enable prefix multimodal cache. Currently only supports mm-only. |
|
bool flag (set to enable) |
Forward hooks#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
JSON-formatted list of forward hook specifications. Each element must include |
|
Type: JSON list |
Deprecated arguments#
Argument |
Description |
Defaults |
Options |
|---|---|---|---|
|
NOTE: –enable-ep-moe is deprecated. Please set |
|
N/A |
|
NOTE: –enable-deepep-moe is deprecated. Please set |
|
N/A |
|
Note: Note: –prefill-round-robin-balance is deprecated now. |
|
N/A |
|
NOTE: –enable-flashinfer-cutlass-moe is deprecated. Please set |
|
N/A |
|
NOTE: –enable-flashinfer-cutedsl-moe is deprecated. Please set |
|
N/A |
|
NOTE: –enable-flashinfer-trtllm-moe is deprecated. Please set |
|
N/A |
|
NOTE: –enable-triton-kernel-moe is deprecated. Please set |
|
N/A |
|
NOTE: –enable-flashinfer-mxfp4-moe is deprecated. Please set |
|
N/A |
|
Crash the server on nan logprobs. |
|
Type: str |
|
Mix ratio in [0,1] between uniform and hybrid kv buffers (0.0 = pure uniform: swa_size / full_size = 1)(1.0 = pure hybrid: swa_size / full_size = local_attention_size / context_length) |
|
Optional[float] |
|
The interval of load watching in seconds. |
|
Type: float |
|
Choose the NSA backend for the prefill stage (overrides |
|
|
|
Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides |
|
|