Environment Variables

SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time. Note: The canonical prefix for all SGLang environment variables is SGLANG_. The legacy SGL_ prefix is deprecated: any SGL_* variable set in the environment is automatically rewritten to its SGLANG_* equivalent at import time with a deprecation warning, and the alias will be removed in a future release. A few variables keep an upstream/vendor prefix (e.g. MOONCAKE_*, ASCEND_*) because that is their canonical name.

General Configuration

Environment Variable	Description	Default Value
`SGLANG_USE_MODELSCOPE`	Enable using models from ModelScope	`false`
`SGLANG_HOST_IP`	Host IP address for the server	`0.0.0.0`
`SGLANG_PORT`	Port for the server	auto-detected
`SGLANG_LOGGING_CONFIG_PATH`	Custom logging configuration path	Not set
`SGLANG_LOG_REQUEST_HEADERS`	Comma-separated list of additional HTTP headers to log when `—log-requests` is enabled. Appends to the default `x-smg-routing-key`.	Not set
`SGLANG_HEALTH_CHECK_TIMEOUT`	Timeout for health check in seconds	`20`
`SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL`	The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled.	`0`
`SGLANG_EPLB_P2P_BATCH_CHUNK_SIZE`	Number of expert IDs per batch when submitting P2P ops during EPLB rebalance (CUDA and ROCm). Smaller values prevent NCCL/RCCL GPU-side accumulation hangs but increase overhead; set `>= num_physical_experts` to submit a single batch. Deprecated alias: `SGLANG_EPLB_ROCM_P2P_BATCH_CHUNK_SIZE`.	`32`
`SGLANG_FORWARD_UNKNOWN_TOOLS`	Forward unknown tool calls to clients instead of dropping them	`false` (drop unknown tools)
`SGLANG_REQ_WAITING_TIMEOUT`	Timeout (in seconds) for requests waiting in the queue before being scheduled	`-1`
`SGLANG_REQ_RUNNING_TIMEOUT`	Timeout (in seconds) for requests running in the decode batch	`-1`
`SGLANG_CACHE_DIR`	Cache directory for model weights and other data	`~/.cache/sglang`
`SGLANG_PREFETCH_BLOCK_SIZE_MB`	Block size (in MB) for sequential checkpoint prefetch reads that warm the OS page cache before workers load weights via mmap	`16`

Performance Tuning

Environment Variable	Description	Default Value
`SGLANG_ENABLE_TORCH_INFERENCE_MODE`	Control whether to use torch.inference_mode	`false`
`SGLANG_ENABLE_TORCH_COMPILE`	Enable torch.compile	`false`
`SGLANG_SET_CPU_AFFINITY`	Enable CPU affinity setting (often set to `1` in Docker builds)	`false`
`SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN`	Allows the scheduler to overwrite longer context length requests (often set to `1` in Docker builds)	`false`
`SGLANG_IS_FLASHINFER_AVAILABLE`	Control FlashInfer availability check	`true`
`SGLANG_FLASHINFER_AUTOTUNE_CACHE`	Reuse persisted FlashInfer autotune results from `SGLANG_CACHE_DIR` across runs. Set to `0` to force re-autotuning on every startup; the fresh result is written to a `runs/<rank>.<timestamp>.json` sibling file (the canonical cache is left untouched).	`true`
`SGLANG_SKIP_P2P_CHECK`	Skip P2P (peer-to-peer) access check	`false`
`SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD`	Sets the threshold for enabling chunked prefix caching	`8192`
`SGLANG_MAX_KV_CHUNK_CAPACITY`	Maximum number of tokens in each KV chunk for DeepSeek MHA chunked prefix cache	`131072`
`SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION`	Enable RoPE fusion in Fused Multi-Layer Attention	`1`
`SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAP`	Disable overlap schedule for consecutive prefill batches	`false`
`SGLANG_SCHEDULER_MAX_RECV_PER_POLL`	Set the maximum number of requests per poll, with a negative value indicating no limit	`-1`
`SGLANG_MAX_NEW_TOKENS_LIMIT`	Hard server-side limit for each generation request’s `max_new_tokens`; requests asking for more are clipped. Disabled when unset or non-positive.	Not set
`SGLANG_DATA_PARALLEL_BUDGET_INTERVAL`	Interval for DPBudget updates	`1`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULT`	Default weight value for scheduler recv skipper counter (used when forward mode doesn’t match specific modes). Only active when `—scheduler-recv-interval > 1`. The counter accumulates weights and triggers request polling when reaching the interval threshold.	`1000`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODE`	Weight increment for decode forward mode in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency during decode phase.	`1`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_TARGET_VERIFY`	Weight increment for target verify forward mode in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency during verification phase.	`1`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONE`	Weight increment when forward mode is None in scheduler recv skipper. Works with `--scheduler-recv-interval` to control polling frequency when no specific forward mode is active.	`1`
`SGLANG_MM_BUFFER_SIZE_MB`	Size of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. When set to a positive value, temporarily moves features to GPU for faster hash computation, then moves them back to CPU to save GPU memory. Larger features benefit more from GPU hashing. Set to `0` to disable.	`0`
`SGLANG_MM_PRECOMPUTE_HASH`	Enable precomputing of hash values for MultimodalDataItem	`false`
`SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH`	Enable NCCL for gathering when preparing mlp sync batch under overlap scheduler (without this flag gloo is used for gathering)	`false`
`SGLANG_SYMM_MEM_PREALLOC_GB_SIZE`	Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg `--enable-symm-mem` is set.	`-1`
`SGLANG_SKIP_SOFTMAX_PREFILL_THRESHOLD_SCALE_FACTOR`	Skip-softmax threshold scale factor for TRT-LLM prefill attention in flashinfer. `None` means standard attention. See https://arxiv.org/abs/2512.12087	`None`
`SGLANG_SKIP_SOFTMAX_DECODE_THRESHOLD_SCALE_FACTOR`	Skip-softmax threshold scale factor for TRT-LLM decode attention in flashinfer. `None` means standard attention. See https://arxiv.org/abs/2512.12087	`None`
`SGLANG_USE_SGL_FA3_KERNEL`	Use sgl-kernel implementation for FlashAttention v3	`true`

DeepGEMM Configuration (Advanced Optimization)

Environment Variable	Description	Default Value
`SGLANG_ENABLE_JIT_DEEPGEMM`	Enable Just-In-Time compilation of DeepGEMM kernels (enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs when the DeepGEMM package is installed; set to `"0"` to disable)	`"true"`
`SGLANG_JIT_DEEPGEMM_PRECOMPILE`	Enable precompilation of DeepGEMM kernels	`"true"`
`SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS`	Number of workers for parallel DeepGEMM kernel compilation	`4`
`SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGE`	Indicator flag used during the DeepGEMM precompile script	`"false"`
`SGLANG_DG_CACHE_DIR`	Directory for caching compiled DeepGEMM kernels	`~/.cache/deep_gemm`
`SGLANG_DG_USE_NVRTC`	Use NVRTC (instead of Triton) for JIT compilation (Experimental)	`“false”`
`SGLANG_USE_DEEPGEMM_BMM`	Use DeepGEMM for Batched Matrix Multiplication (BMM) operations	`"false"`
`SGLANG_JIT_DEEPGEMM_FAST_WARMUP`	Precompile less kernels during warmup, which reduces the warmup time from 30min to less than 3min. Might cause performance degradation during runtime.	`"false"`

DeepEP Configuration

Environment Variable	Description	Default Value
`SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK`	The maximum number of dispatched tokens on each GPU	`"128"`
`SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANK`	The maximum number of dispatched tokens on each GPU for —moe-a2a-backend=flashinfer	`"1024"`
`SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS`	Number of SMs used for DeepEP combine when single batch overlap is enabled	`"32"`
`SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO`	Run shared experts on an alternate stream when single batch overlap is enabled on GB200. When not setting this flag, shared experts and down gemm will be overlapped with DeepEP combine together.	`"false"`
`SGLANG_DISABLED_MODEL_ARCHS`	Comma-separated list of model architectures to disable from auto-registration.	Not set
`SGLANG_SORT_WEIGHT_FILES`	Controls weight-file ordering for load-time I/O optimization. `-1` disables sorting/staggering (original order); `0` sorts files only; a value `k` greater than 0 sorts and staggers per-rank order with factor `k` for better multi-rank I/O concurrency.	`0`
`SGLANG_RETURN_ORIGINAL_LOGPROB`	Return the original (pre-temperature) logprobs instead of the post-sampling values.	`false`
`SGLANG_ENABLE_COLOCATED_BATCH_GEN`	Enable colocated batch generation.	`false`
`SGLANG_ENABLE_MOE_DEFERRED_FINALIZE`	Defer the MoE finalize step to overlap it with other work.	`false`
`SGLANG_PATCH_TOKENIZER`	Patch the tokenizer to cache `all_special_tokens`/`all_special_ids` (notably for Kimi tiktoken, where ITL can otherwise regress under high batch size).	`true`
`SGLANG_ENABLE_LOGITS_PROCESSER_CHUNK`	Process logits in chunks to reduce peak memory.	`false`
`SGLANG_LOGITS_PROCESSER_CHUNK_SIZE`	Chunk size (in tokens) used when logits-processor chunking is enabled.	`2048`
`SGLANG_FLASHINFER_USE_PAGED`	Use the paged FlashInfer attention path.	`false`
`SGLANG_FLASHINFER_WORKSPACE_SIZE`	FlashInfer workspace size in bytes (default ≈ 384 MiB).	`402653184`
`SGLANG_PREP_IN_CUDA_GRAPH`	Capture input preparation inside the CUDA graph.	`true`
`SGLANG_EAGER_INPUT_NO_COPY`	In eager forward, wrap the ForwardBatch’s own tensors instead of copying them into the CUDA graph buffer registry (skips a per-iter device-to-device copy).	`false`
`SGLANG_DEEPGEMM_SANITY_CHECK`	Run extra sanity checks on DeepGEMM kernels.	`false`
`SGLANG_DEEPGEMM_PDL`	Enable Programmatic Dependent Launch (PDL) for DeepGEMM kernels.	`true`
`SGLANG_PP_PARALLEL_DEEPGEMM_WARMUP`	Run DeepGEMM warmup in parallel across pipeline-parallel ranks.	`false`
`SGLANG_DISABLE_STATIC_WATERFILL`	Force dynamic Waterfill with runtime EP all-reduce instead of the default static local-batch path.	`false`
`SGLANG_NIXL_EP_BF16_DISPATCH`	Use BF16 for NIXL-EP dispatch.	`false`
`SGLANG_NIXL_EP_NUM_MAX_DISPATCH_TOKENS_PER_RANK`	Maximum number of dispatched tokens per GPU for NIXL-EP.	`128`

MORI Configuration

Environment Variable	Description	Default Value
`SGLANG_MORI_DISPATCH_DTYPE`	Override MoRI-EP dispatch quantization type. `auto` uses auto-detection from weight dtype; `bf16`/`fp8`/`fp4` forces the specified type for all layers	`”auto”`
`SGLANG_MORI_FP8_COMB`	Use FP8 for combine	`”false”`
`MORI_DISABLE_AUTO_XGMI`	Set to `0` to allow Mori to automatically use XGMI for same-node PD disaggregation when no active RDMA device is available.	`unset`
`SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK`	Maximum number of dispatch tokens per rank for MORI-EP buffer allocation	`4096`
`SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD`	Threshold for switching between `InterNodeV1` and `InterNodeV1LL` kernel types. `InterNodeV1LL` is used if `SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK` is less than or equal to this threshold; otherwise, `InterNodeV1` is used.	`256`
`SGLANG_MORI_PREALLOC_MAX_RECV_TOKENS`	This argument devives `SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK` which indicates customized amount of tokens preallocated for a rank, valid range from 1 to world_size*SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK, by default `0` means maximum. Setting a smaller value will reduce memory footprint but too small value could cause buffer overflow.	`0`
`SGLANG_MORI_MOE_MAX_INPUT_TOKENS`	Truncate the dispatch buffer to this many rows before MoE computation, reducing kernel overhead on padding tokens. The value must be >= the actual number of received tokens (`totalRecvTokenNum`); setting it too small causes incorrect results. `0` disables truncation (use full buffer).	`0`
`SGLANG_MORI_QP_PER_TRANSFER`	Number of RDMA Queue Pairs (QPs) used per transfer operation	`1`
`SGLANG_MORI_POST_BATCH_SIZE`	Number of RDMA work requests posted in a single batch to each QP	`-1`
`SGLANG_MORI_NUM_WORKERS`	Number of worker threads in the RDMA executor thread pool	`1`

DSA Backend Configuration (For DeepSeek V3.2)

Environment Variable	Description	Default Value
`SGLANG_DSA_FUSE_TOPK`	Fuse the operation of picking topk logits and picking topk indices from page table. `SGLANG_NSA_FUSE_TOPK` is a deprecated alias.	`true`
`SGLANG_DSA_TOPK_FLASHINFER_DETERMINISTIC`	Use deterministic FlashInfer topk kernels when `—dsa-topk-backend=flashinfer`.	`false`
`SGLANG_DSA_TOPK_FLASHINFER_TIE_BREAK`	Tie-break mode for FlashInfer DSA topk when `—dsa-topk-backend=flashinfer`: unset disables explicit tie-breaking, `small` prefers the smaller candidate index for equal scores, and `large` prefers the larger candidate index for equal scores. Setting this variable makes FlashInfer use deterministic topk.	`unset`
`SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD`	When the maximum kv len in current prefill batch exceeds this value, the sparse mla kernel will be applied, else it falls back to dense MHA implementation. Default to the index topk of model (2048 for DeepSeek V3.2). `SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD` is a deprecated alias.	`2048`
`SGLANG_DSA_TOPK_BROADCAST`	Experimental. When enabled, broadcast the finalized NSA/DSA indexer top-k result from attention TP rank 0 to the other attention TP ranks. This can mitigate top-k mismatches in TP attention runs at the cost of some speed.	`false`
`SGLANG_MORI_SEND_AUX_RDMA`	Send CPU-resident AUX data via RDMA instead of ZMQ TCP.	`false`
`SGLANG_MORI_TRANSFER_SHARDS`	Number of sharded synchronous worker threads draining KV transfers; also bounds outstanding transfers (primary RDMA send-queue throttle).	`8`
`SGLANG_MORI_WAIT_POLL_MS`	Poll cadence (ms) at which a transfer worker wakes to check the SLA while waiting for completion.	`1000`
`SGLANG_MORI_TRANSFER_TIMEOUT_MS`	Per-transfer SLA (ms) before a KV transfer is failed; `0` disables the SLA.	`0`
`SGLANG_DSA_HIP_DISABLE_PRESHUFFLE`	Disable weight pre-shuffle on the HIP DSA path. `SGLANG_NSA_HIP_DISABLE_PRESHUFFLE` is a deprecated alias.	`false`
`SGLANG_DSA_MQA_LOGITS_FREE_MEM_FRACTION`	Fraction of free memory the MQA-logits step may use on the DSA path.	`0.2`

Memory Management

Environment Variable	Description	Default Value
`SGLANG_DEBUG_MEMORY_POOL`	Enable memory pool debugging	`false`
`SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION`	Clip max new tokens estimation for memory planning	`4096`
`SGLANG_DETOKENIZER_MAX_STATES`	Maximum states for detokenizer	Default value based on system
`SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK`	Enable checks for memory imbalance across Tensor Parallel ranks	`true`
`SGLANG_MOONCAKE_CUSTOM_MEM_POOL`	Configure the custom memory pool type for Mooncake. Supports `NVLINK`, `BAREX`, `INTRA_NODE_NVLINK`. If set to `true`, it defaults to `NVLINK`.	`None`

Model-Specific Options

Environment Variable	Description	Default Value
`SGLANG_USE_AITER`	Use AITER optimize implementation	`false`
`SGLANG_ROCM_USE_MULTI_STREAM`	Allocate alt CUDA/HIP stream on ROCm/AITER to overlap shared and routed experts in DeepseekV2 MoE. Requires the HIP env `GPU_MAX_HW_QUEUES>=5` (default `4`, the cap on HSA/ROCr HW queues HIP creates) so the alt stream gets its own queue instead of serializing with the main stream. Best paired with `—deepep-mode low_latency` so Mori’s AsyncLL kernel offloads dispatch/combine to copy engines and frees CUs.	`false`
`SGLANG_MOE_PADDING`	Enable MoE padding (sets padding size to 128 if value is `1`, often set to `1` in Docker builds)	`false`
`SGLANG_CUTLASS_MOE` (deprecated)	Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use —moe-runner-backend=cutlass)	`false`
`SGLANG_USE_FUSED_PARALLEL_QKNORM`	Use the fused parallel QK RMSNorm kernel for MiniMax-M2.x on CUDA when attention TP size > 1	`false`
`SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_BUSY`	Enable strict memory checks while the scheduler is busy.	`0`
`SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_IDLE`	Enable strict memory checks while the scheduler is idle.	`true`
`SGLANG_NATIVE_MOVE_KV_CACHE`	Use the native implementation to move KV cache entries.	`false`
`SGLANG_USE_BREAKABLE_CUDA_GRAPH`	Use a breakable CUDA graph so it can be interrupted/rebuilt at runtime.	`false`
`SGLANG_MEMORY_SAVER_CUDA_GRAPH`	Allow CUDA graphs under the release/resume memory saver.	`false`
`SGLANG_GEMMA_OUT_OF_PLACE_POSITION_MUTATION`	Use out-of-place position mutation for Gemma models.	`false`
`SGLANG_MAMBA_CONV_DTYPE`	dtype for the Mamba convolution state.	`bfloat16`
`SGLANG_MAMBA_SSM_DTYPE`	dtype for the Mamba SSM state (defaults to the model dtype when unset).	Not set
`SGLANG_EMBEDDINGS_SPARSE_HEAD`	Name of the sparse-embeddings head to expose for embedding models.	Not set
`SGLANG_DSV4_FP4_EXPERTS`	Whether DeepSeek V4 experts use FP4. Set to `false` when using an FP4-to-FP8 converted DeepSeek V4 checkpoint.	`true`
`SGLANG_DSV4_REASONING_EFFORT`	Default `reasoning_effort` for the DeepSeek V4 chat encoder when a request does not set it (accepts `max`, `high`; empty means unset).	`""`

Quantization

Environment Variable	Description	Default Value
`SGLANG_INT4_WEIGHT`	Enable INT4 weight quantization	`false`
`SGLANG_FORCE_FP8_MARLIN`	Force using FP8 MARLIN kernels even if other FP8 kernels are available	`false`
`SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN`	Quantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint	`false`
`SGLANG_MOE_NVFP4_DISPATCH`	Use nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend)	`“false”`
`SGLANG_FLASHINFER_NVFP4_PER_TOKEN_ACTIVATION`	Enable FlashInfer TRTLLM NVFP4 per-token activation scaling; ignores checkpoint activation FP32 scale by treating it as `1`	`false`
`FLASHINFER_NVFP4_4OVER6`	Enable FlashInfer NVFP4 4over6 scaling for NVFP4 per-token activation and online NVFP4 MoE weight quantization paths	`false`
`FLASHINFER_NVFP4_4OVER6_E4M3_USE_256`	Use `256` as the E4M3 scale maximum for FlashInfer NVFP4 4over6 scaling; otherwise uses `448`	`false`
`SGLANG_NVFP4_CKPT_FP8_NEXTN_MOE`	Quantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint	`false`
`SGLANG_QUANT_ALLOW_DOWNCASTING`	Allow weight dtype downcasting during loading (e.g., fp32 → fp16). By default, SGLang rejects this kind of downcasting when using quantization.	`false`
`SGLANG_FP8_IGNORED_LAYERS`	A comma-separated list of layer names to ignore during FP8 quantization. For example: `model.layers.0,model.layers.1.,qkv_proj`.	`""`
`SGLANG_FP4_IGNORED_LAYERS`	A comma-separated list of layer names to keep out of FP4 online quantization, including `nvfp4_online`. For example: `model.layers.40,model.layers.41`.	`""`
`SGLANG_HUMMING_ONLINE_QUANT_CONFIG`	JSON object or JSON file path for Humming online weight quantization. When a layer has no checkpoint quantization config, this config tells Humming how to quantize the loaded fp16/bf16 weight. When the checkpoint already has a Humming config, add `“force_requant”: true` to requantize it to this schema during loading. Common keys include `dtype`/`weight_dtype`, `scale_dtype`, `group_size`, `scale_type`, `ignored_layers`, `ignore`, and `modules_to_not_convert`.	`None`
`SGLANG_HUMMING_INPUT_QUANT_CONFIG`	JSON object or JSON file path for Humming input activation quantization. This controls the activation dtype and scale grouping passed into Humming kernels, independently of the weight schema. For example, quantizes Humming inputs to FP8 E4M3.	`None`
`SGLANG_HUMMING_USE_F16_ACCUM`	Use FP16 accumulation in Humming compute/tuning config. This is only meaningful for Humming dtype combinations that support FP16 accumulation, such as fp16 or FP8 E4M3 activations with float16 output. Leave it `false` for the default accumulator behavior.	`false`
`SGLANG_HUMMING_MOE_GEMM_TYPE`	Select the Humming MoE GEMM path for standard dispatch and DeepEP normal dispatch. `indexed` uses top-k expert ids directly and is the fallback for unset or unknown values. `grouped` maps to Humming grouped-contiguous GEMM. DeepEP low-latency dispatch uses grouped-masked GEMM internally and does not use this selector.	`""` (`indexed`)

Distributed Computing

Environment Variable	Description	Default Value
`SGLANG_BLOCK_NONZERO_RANK_CHILDREN`	Control blocking of non-zero rank children processes	`1`
`SGLANG_IS_FIRST_RANK_ON_NODE`	Indicates if the current process is the first rank on its node	`"true"`
`SGLANG_PP_LAYER_PARTITION`	Pipeline parallel layer partition specification	Not set
`SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESS`	Set one visible device per process for distributed computing	`false`
`SGLANG_RAY_BUNDLE_INDICES`	Comma-separated bundle indices for Ray actor placement (e.g., `“0,1,2,3”`). Must match world_size. Enables fine-grained GPU assignment in custom placement groups.	Not set
`SGLANG_CPU_QUANTIZATION`	Enable CPU-side quantization.	`false`
`SGLANG_USE_DYNAMIC_MXFP4_LINEAR`	Use dynamic MXFP4 quantization for linear layers.	`false`
`USE_TRITON_W8A8_FP8_KERNEL`	Use the Triton W8A8 FP8 kernel.	`false`
`SGLANG_USE_MESSAGE_QUEUE_BROADCASTER`	Use the shared-memory message-queue broadcaster for inter-process tensor broadcast.	`true`
`SGLANG_DISTRIBUTED_INIT_METHOD_OVERRIDE`	Override the init method used by `torch.distributed.init_process_group`. Set to `env://` to use an externally-created TCPStore via `MASTER_ADDR`/`MASTER_PORT`.	Not set
`SGLANG_TCP_STORE_PORT`	Port for the torch.distributed TCPStore.	`29600`
`SGLANG_SYNC_TOKEN_IDS_ACROSS_TP`	Synchronize sampled token ids across tensor-parallel ranks.	`false`

PD Disaggregation — Staging Buffer (Heterogeneous TP)

Environment Variable	Description	Default Value
`SGLANG_DISAGG_STAGING_BUFFER`	Enable GPU staging buffer for heterogeneous TP KV transfer. Required when prefill and decode use different TP/attention-TP sizes. Only for non-MLA models (e.g. GQA, MHA).	`false`
`SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB`	Prefill-side per-worker staging buffer size in MB. Used for gathering KV head slices before bulk RDMA transfer.	`64`
`SGLANG_DISAGG_STAGING_POOL_SIZE_MB`	Decode-side ring buffer pool total size in MB. Shared buffer receiving RDMA data from all prefill ranks. Larger values support higher concurrency.	`4096`
`SGLANG_STAGING_USE_TORCH`	Force using PyTorch gather/scatter fallback instead of Triton fused kernels for staging operations. Useful for debugging.	`false`

Testing & Debugging (Internal/CI)

These variables are primarily used for internal testing, continuous integration, or debugging.

Environment Variable	Description	Default Value
`SGLANG_IS_IN_CI`	Indicates if running in CI environment	`false`
`SGLANG_IS_IN_CI_AMD`	Indicates running in AMD CI environment	`false`
`SGLANG_TEST_RETRACT`	Enable retract decode testing	`false`
`SGLANG_TEST_RETRACT_NO_PREFILL_BS`	When SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds SGLANG_TEST_RETRACT_NO_PREFILL_BS.	`2 ** 31`
`SGLANG_RECORD_STEP_TIME`	Record step time for profiling	`false`
`SGLANG_TEST_REQUEST_TIME_STATS`	Test request time statistics	`false`
`SGLANG_DEBUG_SYMM_MEM`	Enable debug checks that verify tensors passed to NCCL communication ops are allocated in the symmetric memory pool. Logs warnings (rank 0 only) with stack traces for any tensor not in the pool.	`false`
`SGLANG_KERNEL_API_LOGLEVEL`	Controls crash-debug kernel API logging. `0` disables logging, `1` logs API names, `3` logs tensor metadata, `5` adds tensor statistics, and `10` also writes pre-call dump snapshots.	`0`
`SGLANG_KERNEL_API_LOGDEST`	Destination for crash-debug kernel API logs. Use `stdout`, `stderr`, or a file path. `%i` is replaced with the process PID.	`stdout`
`SGLANG_KERNEL_API_DUMP_DIR`	Output directory for level-10 kernel API input/output dumps. `%i` is replaced with the process PID.	`sglang_kernel_api_dumps`
`SGLANG_KERNEL_API_DUMP_INCLUDE`	Comma-separated wildcard patterns for kernel API names to include in level-10 dumps.	Not set
`SGLANG_KERNEL_API_DUMP_EXCLUDE`	Comma-separated wildcard patterns for kernel API names to exclude from level-10 dumps.	Not set

Profiling & Benchmarking

Environment Variable	Description	Default Value
`SGLANG_TORCH_PROFILER_DIR`	Directory for PyTorch profiler output	`/tmp`
`SGLANG_PROFILE_WITH_STACK`	Set `with_stack` option (bool) for PyTorch profiler (capture stack trace)	`true`
`SGLANG_PROFILE_RECORD_SHAPES`	Set `record_shapes` option (bool) for PyTorch profiler (record shapes)	`true`
`SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS`	Config BatchSpanProcessor.schedule_delay_millis if tracing is enabled	`500`
`SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE`	Config BatchSpanProcessor.max_export_batch_size if tracing is enabled	`64`
`SGLANG_PROFILE_V2`	Use the v2 profiler implementation.	`false`
`SGLANG_DETECT_SLOW_RANK`	Detect and report ranks that fall behind during collective ops.	`false`
`SGLANG_FORCE_SHUTDOWN`	Force an immediate process-group shutdown on exit.	`false`
`SGLANG_PYSPY_DUMP_BEFORE_CRASH`	Capture a py-spy stack dump of all processes before crashing.	`true`
`SGLANG_CUDA_COREDUMP`	Enable CUDA coredump generation (auto-injects the required `CUDA_*` env vars).	`false`
`SGLANG_CUDA_COREDUMP_DIR`	Directory for CUDA coredumps. If unset, resolves to `RUNNER_TEMP` in CI, else `/tmp`.	Not set
`SGLANG_CUDA_COREDUMP_BEFORE_CRASH`	Trigger a CUDA coredump before crashing.	`true`
`SGLANG_CUDA_COREDUMP_BEFORE_CRASH_WAIT_SECS`	Seconds to wait for the CUDA coredump to finish before exiting.	`60.0`

Storage & Caching

Environment Variable	Description	Default Value
`SGLANG_WAIT_WEIGHTS_READY_TIMEOUT`	Timeout period for waiting on weights	`120`
`SGLANG_DISABLE_OUTLINES_DISK_CACHE`	Disable Outlines disk cache	`false`
`SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHE`	Use SGLang’s custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA)	`false`
`SGLANG_HICACHE_DECODE_OFFLOAD_STRIDE`	Decode-side incremental KV cache offload stride. Rounded down to a multiple of `—page-size` (min is `—page-size`). If unset/invalid/<=0, it falls back to `—page-size`.	Not set (uses `—page-size`)
`SGLANG_HICACHE_NIXL_USE_DIRECT_IO`	Enable `O_DIRECT` for any file-based NIXL backend (POSIX, GDS, GDS_MT, 3FS) when opening cache files (bypasses the OS page cache, reducing memory pressure and improving throughput on NVMe). Can also be disabled via in `—hicache-storage-backend-extra-config`. Falls back to buffered I/O with a warning when `O_DIRECT` is unavailable on the current OS.	`true`
`SGLANG_HUGEPAGE_SIZE`	Use huge pages for host KV cache allocations (HiCache / disaggregation offload). Valid values: `2MB` (2 MiB pages via `MAP_HUGE_2MB`) or `1GB` (1 GiB pages via `MAP_HUGE_1GB`). Requires huge pages to be pre-allocated on the host OS (`/proc/sys/vm/nr_hugepages` or `/sys/kernel/mm/hugepages`). If the allocation fails, the allocator logs a warning and falls back to regular page-size mmap automatically.	Not set (uses OS default page size)
`SGLANG_HICACHE_HF3FS_CONFIG_PATH`	Path to the HiCache HF3FS backend config file.	Not set
`SGLANG_HICACHE_FILE_BACKEND_STORAGE_DIR`	Storage directory for the HiCache file backend.	Not set
`SGLANG_HICACHE_FILE_BACKEND_MAX_SIZE`	Max size for HiCache file-backend LRU eviction (accepts SI/IEC suffixes; `0` disables eviction).	Not set (eviction off)
`SGLANG_HICACHE_FILE_BACKEND_EVICTION_RATIO`	Target fraction to evict down to when the file-backend max size is reached.	`0.9`
`SGLANG_HICACHE_FILE_BACKEND_MIN_FREE_SPACE`	Minimum free space to keep on the file-backend volume (accepts SI/IEC suffixes).	`0`
`SGLANG_HICACHE_NIXL_BACKEND_STORAGE_DIR`	Storage directory for the HiCache NIXL backend.	Not set

Function Calling / Tool Use

Environment Variable	Description	Default Value
`SGLANG_TOOL_STRICT_LEVEL`	Controls the strictness level of tool call parsing and validation. <br>Level 0: Off - No strict validation <br>Level 1: Function strict - Enables structural tag constraints for all tools (even if none have `strict=True` set) <br>Level 2: Parameter strict - Enforces strict parameter validation for all tools, treating them as if they all have `strict=True` set	`0`
`SGLANG_DEFAULT_THINKING`	Enable model thinking/reasoning output by default.	`false`
`SGLANG_MAX_THINK_TOKENS`	Cap on thinking tokens. Negative means unlimited; `0` or greater caps the count.	`-1`

Logging & Observability

Environment Variable	Description	Default Value
`SGLANG_LOG_GC`	Log Python garbage-collection pauses.	`false`
`SGLANG_LOG_FORWARD_ITERS`	Log each forward iteration.	`false`
`SGLANG_LOG_MS`	Log per-step timing in milliseconds.	`false`
`SGLANG_LOG_REQUEST_EXCEEDED_MS`	Log requests whose processing time exceeds this many milliseconds. `-1` disables.	`-1`
`SGLANG_LOG_SCHEDULER_STATUS_TARGET`	Target (e.g. a file path) for periodic scheduler-status logging.	`""`
`SGLANG_LOG_SCHEDULER_STATUS_INTERVAL`	Interval (seconds) between scheduler-status log lines.	`60.0`

Constrained Decoding (Grammar)

Environment Variable	Description	Default Value
`SGLANG_GRAMMAR_POLL_INTERVAL`	Poll interval (seconds) for asynchronous grammar compilation.	`0.005`
`SGLANG_GRAMMAR_MAX_POLL_ITERATIONS`	Maximum poll iterations before grammar compilation is treated as stuck.	`10000`

Scheduler & Batching

Environment Variable	Description	Default Value
`SGLANG_INIT_NEW_TOKEN_RATIO`	Initial new-token ratio used for memory planning.	`0.7`
`SGLANG_MIN_NEW_TOKEN_RATIO_FACTOR`	Floor factor for the new-token ratio after decay.	`0.14`
`SGLANG_NEW_TOKEN_RATIO_DECAY_STEPS`	Number of steps over which the new-token ratio decays.	`600`
`SGLANG_RETRACT_DECODE_STEPS`	Number of decode steps to look ahead when deciding to retract.	`20`
`SGLANG_EMPTY_CACHE_INTERVAL`	Interval (seconds) at which to empty the device cache; set this if memory accumulates over a long serving period. `-1` disables.	`-1`
`SGLANG_FORCE_STREAM_INTERVAL`	For non-streaming requests, flush intermediate output batches to the tokenizer manager every N decoded tokens (lower to `1` for accurate TTFT benchmarking).	`50`
`SGLANG_DYNAMIC_CHUNKING_SMOOTH_FACTOR`	Smoothing factor for dynamic prefill chunking.	`0.75`
`SGLANG_SWA_EVICTION_INTERVAL_MULTIPLIER`	Multiplier applied to the sliding-window-attention eviction interval.	`1.0`
`SGLANG_ENABLE_UNIFIED_RADIX_TREE`	Use the unified radix-tree cache implementation.	`false`
`SGLANG_EXPERIMENTAL_CPP_RADIX_TREE`	Use the experimental C++ radix-tree implementation.	`false`
`SGLANG_RADIX_FORCE_MISS`	Force radix-cache misses (debugging/benchmarking).	`false`
`SGLANG_SCHEDULER_SKIP_ALL_GATHER`	Skip the scheduler all-gather step.	`false`
`SGLANG_ENABLE_WAR_BARRIER`	Force-enable the write-after-read barrier for the overlap scheduler even when CUDA is not detected (e.g. AMD/ROCm). On CUDA the barrier is always enabled.	`false`
`SGLANG_PP_SKIP_PURE_CHUNKED_OUTPUT_COMM`	In pipeline parallel, skip output send/recv when a batch is entirely non-final chunked-prefill requests.	`false`
`SGLANG_KILLPG_ON_SCHEDULER_EXCEPTION`	Kill the whole process group when the scheduler raises an exception.	`false`
`SGLANG_REQUEST_STATE_WAIT_TIMEOUT`	Tokenizer-manager request-state wait timeout (seconds).	`4`

PD Disaggregation (Runtime)

Environment Variable	Description	Default Value
`SGLANG_DISAGGREGATION_THREAD_POOL_SIZE`	Thread-pool size for KV transfers. Defaults to a value computed from the CPU count at runtime.	Not set (computed at runtime)
`SGLANG_DISAGGREGATION_QUEUE_SIZE`	Disaggregation transfer queue size.	`4`
`SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT`	Timeout (seconds) for the disaggregation bootstrap handshake.	`300`
`SGLANG_DISAGGREGATION_WAITING_TIMEOUT`	Timeout (seconds) for a request waiting on KV transfer.	`300`
`SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL`	Interval (seconds) between disaggregation heartbeats.	`5.0`
`SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE`	Consecutive heartbeat failures tolerated before a peer is considered dead.	`2`
`SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL`	Interval (seconds) for cleaning up stale bootstrap entries.	`120`
`SGLANG_DISAGGREGATION_NIXL_BACKEND`	NIXL transport backend for disaggregation.	`UCX`
`SGLANG_DISAGGREGATION_NIXL_BACKEND_PARAMS`	JSON parameters passed to the NIXL backend.
`SGLANG_DISAGGREGATION_ALL_CP_RANKS_TRANSFER`	Have all context-parallel ranks participate in KV transfer.	`false`
`SGLANG_DISAGGREGATION_FORCE_QUERY_PREFILL_DP_RANK`	Force querying the prefill DP rank for routing.	`false`
`SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS`	Extra slots in `req_to_token_pool` for decode workers (effective when `max_num_reqs` greater than 32), letting more KV transfers overlap decode.	`0`

Mooncake KV Store & Transfer

Environment Variable	Description	Default Value
`SGLANG_HICACHE_MOONCAKE_CONFIG_PATH`	Path to the HiCache Mooncake store config file.	Not set
`SGLANG_HICACHE_MOONCAKE_REUSE_TE`	Reuse the Mooncake transfer engine across HiCache operations.	`true`
`SGLANG_MOONCAKE_SEND_AUX_TCP`	Send Mooncake AUX data over TCP.	`false`
`SGLANG_ENABLE_FAILED_SESSION_PROBE`	Probe failed Mooncake sessions for recovery.	`false`
`SGLANG_FAILED_SESSION_PROBE_INTERVAL_S`	Interval (seconds) between failed-session probes.	`30.0`
`MOONCAKE_MASTER`	Address of the Mooncake master.	Not set
`MOONCAKE_CLIENT`	Mooncake client identifier.	Not set
`MOONCAKE_LOCAL_HOSTNAME`	Local hostname advertised to Mooncake.	`localhost`
`MOONCAKE_TE_META_DATA_SERVER`	Mooncake transfer-engine metadata server.	`P2PHANDSHAKE`
`MOONCAKE_GLOBAL_SEGMENT_SIZE`	Mooncake global segment size.	`4gb`
`MOONCAKE_PROTOCOL`	Mooncake transport protocol.	`rdma`
`MOONCAKE_DEVICE`	Mooncake RDMA device(s).	`""`
`MOONCAKE_MASTER_METRICS_PORT`	Port for Mooncake master metrics.	`9003`
`MOONCAKE_CHECK_SERVER`	Check connectivity to the Mooncake server on startup.	`false`
`MOONCAKE_STANDALONE_STORAGE`	Run Mooncake in standalone storage mode.	`false`
`MOONCAKE_ENABLE_SSD_OFFLOAD`	Enable SSD offload in Mooncake.	`false`
`MOONCAKE_OFFLOAD_FILE_STORAGE_PATH`	File storage path for Mooncake SSD offload.	Not set
`ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE`	Enable Ascend NPU transfers via Mooncake.	`false`
`ASCEND_NPU_PHY_ID`	Physical Ascend NPU id used for Mooncake transfers. `-1` auto-detects.	`-1`

Attention & Kernels

Environment Variable	Description	Default Value
`SGLANG_TRITON_DECODE_ATTN_STATIC_KV_SPLITS`	Use static KV splits in the Triton decode-attention kernel.	`false`
`SGLANG_MUSA_FA3_FORCE_UPDATE_METADATA`	Force FA3 metadata updates on MThreads MUSA.	`false`
`SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK`	Skip the sgl-kernel version compatibility check.	`false`

Deterministic Inference

Environment Variable	Description	Default Value
`SGLANG_ENABLE_DETERMINISTIC_INFERENCE`	Enable deterministic inference (fixed reduction/accumulation order).	`false`
`SGLANG_USE_1STAGE_ALLREDUCE`	Use the 1-stage all-reduce kernel on AMD (deterministic, fixed accumulation order). If unset, it is auto-enabled when deterministic inference is on.	`false`
`SGLANG_FLASHINFER_PREFILL_SPLIT_TILE_SIZE`	FlashInfer prefill split-tile size for deterministic attention.	`4096`
`SGLANG_FLASHINFER_DECODE_SPLIT_TILE_SIZE`	FlashInfer decode split-tile size for deterministic attention.	`2048`
`SGLANG_TRITON_PREFILL_TRUNCATION_ALIGN_SIZE`	Triton prefill truncation alignment size for deterministic attention.	`4096`
`SGLANG_TRITON_DECODE_SPLIT_TILE_SIZE`	Triton decode split-tile size for deterministic attention.	`256`

Speculative Decoding & Overlap

Environment Variable	Description	Default Value
`SGLANG_ENABLE_OVERLAP_PLAN_STREAM`	Plan the next step on a separate stream to overlap with the current step (Overlap Spec V2).	`false`
`SGLANG_SPEC_ENABLE_STRICT_FILTER_CHECK`	Enable strict filter checks in speculative decoding.	`true`
`SGLANG_SPEC_SKIP_ZERO_STEP_DRAFT_EXTEND`	Skip draft_extend while adaptive spec is at steps=0; saves a draft forward but the draft KV goes stale.	`false`
`SGLANG_NGRAM_FORCE_GREEDY_VERIFY`	Force greedy verification for the n-gram speculative path.	`false`
`SGLANG_SANITIZE_NAN_LOGITS`	Sanitize NaN logits before sampling kernels and emit a throttled warning.	`true`

EPLB (Expert Parallel Load Balancing)

Environment Variable	Description	Default Value
`SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR`	Output directory for the expert-distribution recorder.	`/tmp`
`SGLANG_ENABLE_EPLB_BALANCEDNESS_METRIC`	Emit an EPLB balancedness metric.	`false`
`SGLANG_LOG_EXPERT_LOCATION_METADATA`	Log expert-location metadata.	`false`
`SGLANG_EXPERT_LOCATION_UPDATER_LOG_INPUT`	Log inputs to the expert-location updater.	`false`
`SGLANG_EXPERT_LOCATION_UPDATER_LOG_METRICS`	Log metrics from the expert-location updater.	`false`

AMD & ROCm

Environment Variable	Description	Default Value
`SGLANG_USE_AITER_AG`	Use the AITER all-gather implementation.	`true`
`SGLANG_USE_AITER_UNIFIED_ATTN`	Use the AITER unified attention kernel.	`false`
`SGLANG_USE_AITER_FP8_PER_TOKEN`	Use AITER FP8 per-token quantization.	`false`
`SGLANG_USE_AITER_MOE_GU_ITLV`	Select the AITER MoE gate/up tile layout: `true` interleaves, `false` uses the separated layout.	`true`
`SGLANG_AITER_FUSE_RMSNORM_PAD`	Fuse the residual-add + RMSNorm + zero-pad triplet before the MoE block via the AITER Triton kernel (TP=1, post-attention layernorm path only).	`false`
`SGLANG_AITER_KV_CACHE_LAYOUT`	Physical layout for the MHA KV cache on AITER: `nhd` or `vectorized_5d` (SHUFFLE layout enabling pa_decode_gluon).	`nhd`
`SGLANG_ROCM_FUSED_DECODE_MLA`	Use the fused decode MLA kernel on ROCm.	`false`
`SGLANG_ROCM_DISABLE_LINEARQUANT`	Disable linear-layer quantization on ROCm.	`false`

NPU (Ascend)

Environment Variable	Description	Default Value
`SGLANG_NPU_DISABLE_ACL_FORMAT_WEIGHT`	Disable ACL-format weight conversion on NPU.	`false`
`SGLANG_NPU_USE_MULTI_STREAM`	Use multiple streams on NPU.	`false`
`SGLANG_NPU_USE_MLAPO`	Use the MLAPO path on NPU.	`false`
`SGLANG_NPU_FORWARD_NATIVE_GELUTANH`	Use the native gelu-tanh activation forward (for Skywork-Reward-Gemma-2-27B-v0.2).	`false`
`SGLANG_NPU_FORWARD_NATIVE_GEMMA_RMS_NORM`	Use the native Gemma RMSNorm forward (for Skywork-Reward-Gemma-2-27B-v0.2).	`false`
`SGLANG_USE_AG_AFTER_QLORA`	Delay all-gather until after QLoRA for better DeepSeek V3.2 performance.	`false`
`SGLANG_EXPERIMENTAL_LORA_OPTI`	Master switch for the experimental TRT-LLM LoRA fast path. When off, all fine-grained opt switches read false.	`false`
`SGLANG_ZBAL_LOCAL_MEM_SIZE`	Local memory size for the ZBAL (zero-buffer accelerate library) path (NPU only).	`0`
`SGLANG_ZBAL_BOOTSTRAP_URL`	Bootstrap URL for the ZBAL path (NPU only).	`""`

Apple Silicon (MLX / MPS)

Environment Variable	Description	Default Value
`SGLANG_USE_MLX`	Use the MLX backend on Apple Silicon.	`false`
`SGLANG_MLX_USE_CUSTOM_ROPE`	Use the custom RoPE kernel on MLX.	`false`
`SGLANG_MLX_FUSE_SWIGLU`	Fuse the SwiGLU activation on MLX.	`false`
`SGLANG_MLX_CLEAR_CACHE_STEPS`	Number of decode steps between `mx.clear_cache()` calls. `0` disables cache clearing.	`256`

Multimodal (VLM)

Environment Variable	Description	Default Value
`SGLANG_VLM_CACHE_SIZE_MB`	Size (MB) of the VLM feature cache.	`100`
`SGLANG_IMAGE_MAX_PIXELS`	Maximum number of pixels per image before resizing.	`12845056`
`SGLANG_RESIZE_RESAMPLE`	Resampling filter used when resizing images (e.g. `bilinear`, `bicubic`).	`""`
`SGLANG_MM_SKIP_COMPUTE_HASH`	Skip computing multimodal-item hashes.	`false`
`SGLANG_MM_AVOID_RETOKENIZE`	For pre-tokenized (list[int]) multimodal prompts, preserve the user’s original tokens to avoid retokenization drift.	`true`
`SGLANG_VIT_ENABLE_CUDA_GRAPH`	Capture the vision encoder (ViT) in a CUDA graph.	`false`
`SGLANG_USE_CUDA_IPC_TRANSPORT`	Use CUDA IPC transport for multimodal-item tensors.	`false`
`SGLANG_USE_IPC_POOL_HANDLE_CACHE`	Cache CUDA IPC pool handles.	`false`
`SGLANG_MM_FEATURE_CACHE_MB`	Size (MB) of the multimodal feature cache.	`1024`
`SGLANG_MM_ITEM_MEM_POOL_RECYCLE_INTERVAL_SEC`	Interval (seconds) for recycling the multimodal-item memory pool.	`0.05`

Encoder / EPD (Multimodal Disaggregation)

Environment Variable	Description	Default Value
`SGLANG_ENCODER_MM_RECEIVER_MODE`	Encoder receiver selection used by EPD paths: `http` or `grpc`.	`http`
`SGLANG_ENCODER_GRPC_TIMEOUT_SECS`	gRPC timeout (seconds) for encoder communication.	`60`
`SGLANG_ENCODER_RECV_TIMEOUT`	Encoder receive timeout (seconds).	`180.0`
`SGLANG_ENCODER_SEND_TIMEOUT`	Encoder send timeout (seconds).	`180.0`
`SGLANG_ENCODER_HTTP_TIMEOUT`	Encoder HTTP timeout (seconds).	`1800.0`
`SGLANG_ENCODER_REQ_TIMEOUT`	Encoder per-request timeout (seconds).	`180.0`
`SGLANG_ENCODER_DISPATCH_MIN_ITEMS`	Minimum items before the encoder dispatches a batch.	`2`
`SGLANG_ENCODER_MAX_BATCH_SIZE`	Maximum encoder batch size.	`8`
`SGLANG_ENCODER_PREPROC_WORKERS`	Number of encoder preprocessing workers.	`8`
`SGLANG_ENCODER_IMAGE_PROCESSOR_USE_GPU`	Run the image processor on the GPU.	`false`
`SGLANG_ENCODER_BOOTSTRAP_HEALTH_CHECK_INTERVAL`	EncoderBootstrapServer health-check interval (seconds). `0` disables it.	`10.0`
`SGLANG_ENCODER_BOOTSTRAP_HEALTH_CHECK_TIMEOUT`	EncoderBootstrapServer health-check timeout (seconds).	`2.0`
`SGLANG_EMBEDDING_POOL_SIZE_MB`	Persistent receiver-side GPU embedding pool size (MB) for Mooncake EPD transport. `0` disables (per-request register/deregister).	`4096`
`SGLANG_ENCODER_DP_WORKER_MAX_INFLIGHT`	Maximum in-flight requests per encoder DP worker.	`64`
`SGLANG_BACKUP_PORT_BASE`	Base port for elastic-EP backup ports.	`10000`

HTTP & gRPC Server

Environment Variable	Description	Default Value
`SGLANG_TIMEOUT_KEEP_ALIVE`	HTTP keep-alive timeout (seconds).	`5`
`SGLANG_UVICORN_WORKER_HEALTHCHECK_TIMEOUT`	Uvicorn multiprocess supervisor per-worker health-check interval (seconds).	`10`
`SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION`	Have the `/health` endpoint run a generation as part of the check.	`true`
`SGLANG_WARMUP_TIMEOUT`	If a warmup forward batch takes longer than this many seconds, the server crashes to avoid hanging. `-1` disables; increase (e.g. to 1800) to accommodate kernel JIT precompile.	`-1`
`SGLANG_ENABLE_GRPC`	Enable the native gRPC server (internal, not yet user-facing).	`false`
`SGLANG_GRPC_PORT`	Port for the native gRPC server.	Not set
`SGLANG_GRANIAN_PARENT_PID`	Parent PID for the Granian HTTP/2 worker supervisor.	Not set

NUMA & CPU

Environment Variable	Description	Default Value
`SGLANG_NUMA_BIND_V2`	Use the v2 NUMA binding implementation.	`true`
`SGLANG_AUTO_NUMA_BIND`	Automatically bind processes to NUMA nodes.	`false`
`SGLANG_CRASH_ON_NUMA_BIND_FAILURE`	Crash if NUMA binding fails instead of warning.	`false`

Metrics

Environment Variable	Description	Default Value
`SGLANG_ENABLE_METRICS_DEVICE_TIMER`	Enable device-timer-based metrics.	`false`
`SGLANG_ENABLE_METRICS_DP_ATTENTION`	Enable data-parallel attention metrics.	`false`

External Models

Environment Variable	Description	Default Value
`SGLANG_EXTERNAL_MODEL_PACKAGE`	Python package providing external model implementations.	`""`
`SGLANG_EXTERNAL_MM_MODEL_ARCH`	External multimodal model architecture name.	`""`
`SGLANG_EXTERNAL_MM_PROCESSOR_PACKAGE`	Python package providing the external multimodal processor.	`""`

Plugin System

Environment Variable	Description	Default Value
`SGLANG_PLATFORM`	Platform plugin name to load.	`""`
`SGLANG_PLUGINS`	Comma-separated list of plugins to load.	`""`

Basic Usage

Advanced Features

Supported Models

Developer Guide

References

General Configuration

Performance Tuning

DeepGEMM Configuration (Advanced Optimization)

DeepEP Configuration

MORI Configuration

DSA Backend Configuration (For DeepSeek V3.2)

Memory Management

Model-Specific Options

Quantization

Distributed Computing

PD Disaggregation — Staging Buffer (Heterogeneous TP)

Testing & Debugging (Internal/CI)

Profiling & Benchmarking

Storage & Caching

Function Calling / Tool Use

Logging & Observability

Constrained Decoding (Grammar)

Scheduler & Batching

PD Disaggregation (Runtime)

Mooncake KV Store & Transfer

Attention & Kernels

Deterministic Inference

Speculative Decoding & Overlap

EPLB (Expert Parallel Load Balancing)

AMD & ROCm

NPU (Ascend)

Apple Silicon (MLX / MPS)

Multimodal (VLM)

Encoder / EPD (Multimodal Disaggregation)

HTTP & gRPC Server

NUMA & CPU

Metrics

External Models

Plugin System

​General Configuration

​Performance Tuning

​DeepGEMM Configuration (Advanced Optimization)

​DeepEP Configuration

​MORI Configuration

​DSA Backend Configuration (For DeepSeek V3.2)

​Memory Management

​Model-Specific Options

​Quantization

​Distributed Computing

​PD Disaggregation — Staging Buffer (Heterogeneous TP)

​Testing & Debugging (Internal/CI)

​Profiling & Benchmarking

​Storage & Caching

​Function Calling / Tool Use

​Logging & Observability

​Constrained Decoding (Grammar)

​Scheduler & Batching

​PD Disaggregation (Runtime)

​Mooncake KV Store & Transfer

​Attention & Kernels

​Deterministic Inference

​Speculative Decoding & Overlap

​EPLB (Expert Parallel Load Balancing)

​AMD & ROCm

​NPU (Ascend)

​Apple Silicon (MLX / MPS)

​Multimodal (VLM)

​Encoder / EPD (Multimodal Disaggregation)

​HTTP & gRPC Server

​NUMA & CPU

​Metrics

​External Models

​Plugin System

General Configuration

Performance Tuning

DeepGEMM Configuration (Advanced Optimization)

DeepEP Configuration

MORI Configuration

DSA Backend Configuration (For DeepSeek V3.2)

Memory Management

Model-Specific Options

Quantization

Distributed Computing

PD Disaggregation — Staging Buffer (Heterogeneous TP)

Testing & Debugging (Internal/CI)

Profiling & Benchmarking

Storage & Caching

Function Calling / Tool Use

Logging & Observability

Constrained Decoding (Grammar)

Scheduler & Batching

PD Disaggregation (Runtime)

Mooncake KV Store & Transfer

Attention & Kernels

Deterministic Inference

Speculative Decoding & Overlap

EPLB (Expert Parallel Load Balancing)

AMD & ROCm

NPU (Ascend)

Apple Silicon (MLX / MPS)

Multimodal (VLM)

Encoder / EPD (Multimodal Disaggregation)

HTTP & gRPC Server

NUMA & CPU

Metrics

External Models

Plugin System