Skip to main content
SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time. Note: The canonical prefix for all SGLang environment variables is SGLANG_. The legacy SGL_ prefix is deprecated: any SGL_* variable set in the environment is automatically rewritten to its SGLANG_* equivalent at import time with a deprecation warning, and the alias will be removed in a future release. A few variables keep an upstream/vendor prefix (e.g. MOONCAKE_*, ASCEND_*) because that is their canonical name.

General Configuration

Environment VariableDescriptionDefault Value
SGLANG_USE_MODELSCOPEEnable using models from ModelScopefalse
SGLANG_HOST_IPHost IP address for the server0.0.0.0
SGLANG_PORTPort for the serverauto-detected
SGLANG_LOGGING_CONFIG_PATHCustom logging configuration pathNot set
SGLANG_LOG_REQUEST_HEADERSComma-separated list of additional HTTP headers to log when —log-requests is enabled. Appends to the default x-smg-routing-key.Not set
SGLANG_HEALTH_CHECK_TIMEOUTTimeout for health check in seconds20
SGLANG_EPLB_HEATMAP_COLLECTION_INTERVALThe interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled.0
SGLANG_EPLB_ROCM_P2P_BATCH_CHUNK_SIZENumber of logical expert IDs per batch when submitting P2P ops during EPLB rebalance on ROCm. Smaller values prevent RCCL GPU-side accumulation hangs but increase overhead.32
SGLANG_FORWARD_UNKNOWN_TOOLSForward unknown tool calls to clients instead of dropping themfalse (drop unknown tools)
SGLANG_REQ_WAITING_TIMEOUTTimeout (in seconds) for requests waiting in the queue before being scheduled-1
SGLANG_REQ_RUNNING_TIMEOUTTimeout (in seconds) for requests running in the decode batch-1
SGLANG_CACHE_DIRCache directory for model weights and other data~/.cache/sglang
SGLANG_PREFETCH_BLOCK_SIZE_MBBlock size (in MB) for sequential checkpoint prefetch reads that warm the OS page cache before workers load weights via mmap16

Performance Tuning

Environment VariableDescriptionDefault Value
SGLANG_ENABLE_TORCH_INFERENCE_MODEControl whether to use torch.inference_modefalse
SGLANG_ENABLE_TORCH_COMPILEEnable torch.compilefalse
SGLANG_SET_CPU_AFFINITYEnable CPU affinity setting (often set to 1 in Docker builds)false
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LENAllows the scheduler to overwrite longer context length requests (often set to 1 in Docker builds)false
SGLANG_IS_FLASHINFER_AVAILABLEControl FlashInfer availability checktrue
SGLANG_FLASHINFER_AUTOTUNE_CACHEReuse persisted FlashInfer autotune results from SGLANG_CACHE_DIR across runs. Set to 0 to force re-autotuning on every startup; the fresh result is written to a runs/<rank>.<timestamp>.json sibling file (the canonical cache is left untouched).true
SGLANG_SKIP_P2P_CHECKSkip P2P (peer-to-peer) access checkfalse
SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLDSets the threshold for enabling chunked prefix caching8192
SGLANG_MAX_KV_CHUNK_CAPACITYMaximum number of tokens in each KV chunk for DeepSeek MHA chunked prefix cache131072
SGLANG_FUSED_MLA_ENABLE_ROPE_FUSIONEnable RoPE fusion in Fused Multi-Layer Attention1
SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAPDisable overlap schedule for consecutive prefill batchesfalse
SGLANG_SCHEDULER_MAX_RECV_PER_POLLSet the maximum number of requests per poll, with a negative value indicating no limit-1
SGLANG_DATA_PARALLEL_BUDGET_INTERVALInterval for DPBudget updates1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULTDefault weight value for scheduler recv skipper counter (used when forward mode doesn’t match specific modes). Only active when —scheduler-recv-interval > 1. The counter accumulates weights and triggers request polling when reaching the interval threshold.1000
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODEWeight increment for decode forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during decode phase.1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_TARGET_VERIFYWeight increment for target verify forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during verification phase.1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONEWeight increment when forward mode is None in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency when no specific forward mode is active.1
SGLANG_MM_BUFFER_SIZE_MBSize of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. When set to a positive value, temporarily moves features to GPU for faster hash computation, then moves them back to CPU to save GPU memory. Larger features benefit more from GPU hashing. Set to 0 to disable.0
SGLANG_MM_PRECOMPUTE_HASHEnable precomputing of hash values for MultimodalDataItemfalse
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCHEnable NCCL for gathering when preparing mlp sync batch under overlap scheduler (without this flag gloo is used for gathering)false
SGLANG_SYMM_MEM_PREALLOC_GB_SIZESize of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg --enable-symm-mem is set.-1
SGLANG_SKIP_SOFTMAX_PREFILL_THRESHOLD_SCALE_FACTORSkip-softmax threshold scale factor for TRT-LLM prefill attention in flashinfer. None means standard attention. See https://arxiv.org/abs/2512.12087None
SGLANG_SKIP_SOFTMAX_DECODE_THRESHOLD_SCALE_FACTORSkip-softmax threshold scale factor for TRT-LLM decode attention in flashinfer. None means standard attention. See https://arxiv.org/abs/2512.12087None
SGLANG_USE_SGL_FA3_KERNELUse sgl-kernel implementation for FlashAttention v3true

DeepGEMM Configuration (Advanced Optimization)

Environment VariableDescriptionDefault Value
SGLANG_ENABLE_JIT_DEEPGEMMEnable Just-In-Time compilation of DeepGEMM kernels (enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs when the DeepGEMM package is installed; set to "0" to disable)"true"
SGLANG_JIT_DEEPGEMM_PRECOMPILEEnable precompilation of DeepGEMM kernels"true"
SGLANG_JIT_DEEPGEMM_COMPILE_WORKERSNumber of workers for parallel DeepGEMM kernel compilation4
SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGEIndicator flag used during the DeepGEMM precompile script"false"
SGLANG_DG_CACHE_DIRDirectory for caching compiled DeepGEMM kernels~/.cache/deep_gemm
SGLANG_DG_USE_NVRTCUse NVRTC (instead of Triton) for JIT compilation (Experimental)“false”
SGLANG_USE_DEEPGEMM_BMMUse DeepGEMM for Batched Matrix Multiplication (BMM) operations"false"
SGLANG_JIT_DEEPGEMM_FAST_WARMUPPrecompile less kernels during warmup, which reduces the warmup time from 30min to less than 3min. Might cause performance degradation during runtime."false"

DeepEP Configuration

Environment VariableDescriptionDefault Value
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANKThe maximum number of dispatched tokens on each GPU"128"
SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANKThe maximum number of dispatched tokens on each GPU for —moe-a2a-backend=flashinfer"1024"
SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMSNumber of SMs used for DeepEP combine when single batch overlap is enabled"32"
SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBORun shared experts on an alternate stream when single batch overlap is enabled on GB200. When not setting this flag, shared experts and down gemm will be overlapped with DeepEP combine together."false"
SGLANG_DISABLED_MODEL_ARCHSComma-separated list of model architectures to disable from auto-registration.Not set
SGLANG_SORT_WEIGHT_FILESControls weight-file ordering for load-time I/O optimization. -1 disables sorting/staggering (original order); 0 sorts files only; a value k greater than 0 sorts and staggers per-rank order with factor k for better multi-rank I/O concurrency.0
SGLANG_RETURN_ORIGINAL_LOGPROBReturn the original (pre-temperature) logprobs instead of the post-sampling values.false
SGLANG_ENABLE_COLOCATED_BATCH_GENEnable colocated batch generation.false
SGLANG_ENABLE_MOE_DEFERRED_FINALIZEDefer the MoE finalize step to overlap it with other work.false
SGLANG_PATCH_TOKENIZERPatch the tokenizer to cache all_special_tokens/all_special_ids (notably for Kimi tiktoken, where ITL can otherwise regress under high batch size).true
SGLANG_ENABLE_LOGITS_PROCESSER_CHUNKProcess logits in chunks to reduce peak memory.false
SGLANG_LOGITS_PROCESSER_CHUNK_SIZEChunk size (in tokens) used when logits-processor chunking is enabled.2048
SGLANG_FLASHINFER_USE_PAGEDUse the paged FlashInfer attention path.false
SGLANG_FLASHINFER_WORKSPACE_SIZEFlashInfer workspace size in bytes (default ≈ 384 MiB).402653184
SGLANG_PREP_IN_CUDA_GRAPHCapture input preparation inside the CUDA graph.true
SGLANG_EAGER_INPUT_NO_COPYIn eager forward, wrap the ForwardBatch’s own tensors instead of copying them into the CUDA graph buffer registry (skips a per-iter device-to-device copy).false
SGLANG_DEEPGEMM_SANITY_CHECKRun extra sanity checks on DeepGEMM kernels.false
SGLANG_DEEPGEMM_PDLEnable Programmatic Dependent Launch (PDL) for DeepGEMM kernels.true
SGLANG_PP_PARALLEL_DEEPGEMM_WARMUPRun DeepGEMM warmup in parallel across pipeline-parallel ranks.false
SGLANG_DISABLE_STATIC_WATERFILLForce dynamic DeepEP waterfill with runtime EP all-reduce instead of the default static local-batch path.false
SGLANG_NIXL_EP_BF16_DISPATCHUse BF16 for NIXL-EP dispatch.false
SGLANG_NIXL_EP_NUM_MAX_DISPATCH_TOKENS_PER_RANKMaximum number of dispatched tokens per GPU for NIXL-EP.128

MORI Configuration

Environment VariableDescriptionDefault Value
SGLANG_MORI_DISPATCH_DTYPEOverride MoRI-EP dispatch quantization type. auto uses auto-detection from weight dtype; bf16/fp8/fp4 forces the specified type for all layers”auto”
SGLANG_MORI_FP8_COMBUse FP8 for combine”false”
MORI_DISABLE_AUTO_XGMISet to 0 to allow Mori to automatically use XGMI for same-node PD disaggregation when no active RDMA device is available.unset
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANKMaximum number of dispatch tokens per rank for MORI-EP buffer allocation4096
SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLDThreshold for switching between InterNodeV1 and InterNodeV1LL kernel types. InterNodeV1LL is used if SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK is less than or equal to this threshold; otherwise, InterNodeV1 is used.256
SGLANG_MORI_PREALLOC_MAX_RECV_TOKENSThis argument devives SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK which indicates customized amount of tokens preallocated for a rank, valid range from 1 to world_size*SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK, by default 0 means maximum. Setting a smaller value will reduce memory footprint but too small value could cause buffer overflow.0
SGLANG_MORI_MOE_MAX_INPUT_TOKENSTruncate the dispatch buffer to this many rows before MoE computation, reducing kernel overhead on padding tokens. The value must be >= the actual number of received tokens (totalRecvTokenNum); setting it too small causes incorrect results. 0 disables truncation (use full buffer).0
SGLANG_MORI_QP_PER_TRANSFERNumber of RDMA Queue Pairs (QPs) used per transfer operation1
SGLANG_MORI_POST_BATCH_SIZENumber of RDMA work requests posted in a single batch to each QP-1
SGLANG_MORI_NUM_WORKERSNumber of worker threads in the RDMA executor thread pool1

DSA Backend Configuration (For DeepSeek V3.2)

Environment VariableDescriptionDefault Value
SGLANG_DSA_FUSE_TOPKFuse the operation of picking topk logits and picking topk indices from page table. SGLANG_NSA_FUSE_TOPK is a deprecated alias.true
SGLANG_DSA_TOPK_FLASHINFER_DETERMINISTICUse deterministic FlashInfer topk kernels when —dsa-topk-backend=flashinfer.false
SGLANG_DSA_TOPK_FLASHINFER_TIE_BREAKTie-break mode for FlashInfer DSA topk when —dsa-topk-backend=flashinfer: unset disables explicit tie-breaking, small prefers the smaller candidate index for equal scores, and large prefers the larger candidate index for equal scores. Setting this variable makes FlashInfer use deterministic topk.unset
SGLANG_DSA_ENABLE_MTP_PRECOMPUTE_METADATAPrecompute metadata that can be shared among different draft steps when MTP is enabled. SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATA is a deprecated alias.true
SGLANG_USE_FUSED_METADATA_COPYControl whether to use fused metadata copy kernel for cuda graph replaytrue
SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLDWhen the maximum kv len in current prefill batch exceeds this value, the sparse mla kernel will be applied, else it falls back to dense MHA implementation. Default to the index topk of model (2048 for DeepSeek V3.2). SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD is a deprecated alias.2048
SGLANG_DSA_TOPK_BROADCASTExperimental. When enabled, broadcast the finalized NSA/DSA indexer top-k result from attention TP rank 0 to the other attention TP ranks. This can mitigate top-k mismatches in TP attention runs at the cost of some speed.false
SGLANG_MORI_SEND_AUX_RDMASend CPU-resident AUX data via RDMA instead of ZMQ TCP.false
SGLANG_MORI_TRANSFER_SHARDSNumber of sharded synchronous worker threads draining KV transfers; also bounds outstanding transfers (primary RDMA send-queue throttle).8
SGLANG_MORI_WAIT_POLL_MSPoll cadence (ms) at which a transfer worker wakes to check the SLA while waiting for completion.1000
SGLANG_MORI_TRANSFER_TIMEOUT_MSPer-transfer SLA (ms) before a KV transfer is failed; 0 disables the SLA.0
SGLANG_DSA_HIP_DISABLE_PRESHUFFLEDisable weight pre-shuffle on the HIP DSA path. SGLANG_NSA_HIP_DISABLE_PRESHUFFLE is a deprecated alias.false
SGLANG_DSA_MQA_LOGITS_FREE_MEM_FRACTIONFraction of free memory the MQA-logits step may use on the DSA path.0.2

Memory Management

Environment VariableDescriptionDefault Value
SGLANG_DEBUG_MEMORY_POOLEnable memory pool debuggingfalse
SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATIONClip max new tokens estimation for memory planning4096
SGLANG_DETOKENIZER_MAX_STATESMaximum states for detokenizerDefault value based on system
SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECKEnable checks for memory imbalance across Tensor Parallel rankstrue
SGLANG_MOONCAKE_CUSTOM_MEM_POOLConfigure the custom memory pool type for Mooncake. Supports NVLINK, BAREX, INTRA_NODE_NVLINK. If set to true, it defaults to NVLINK.None

Model-Specific Options

Environment VariableDescriptionDefault Value
SGLANG_USE_AITERUse AITER optimize implementationfalse
SGLANG_ROCM_USE_MULTI_STREAMAllocate alt CUDA/HIP stream on ROCm/AITER to overlap shared and routed experts in DeepseekV2 MoE. Requires the HIP env GPU_MAX_HW_QUEUES>=5 (default 4, the cap on HSA/ROCr HW queues HIP creates) so the alt stream gets its own queue instead of serializing with the main stream. Best paired with —deepep-mode low_latency so Mori’s AsyncLL kernel offloads dispatch/combine to copy engines and frees CUs.false
SGLANG_MOE_PADDINGEnable MoE padding (sets padding size to 128 if value is 1, often set to 1 in Docker builds)false
SGLANG_CUTLASS_MOE (deprecated)Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use —moe-runner-backend=cutlass)false
SGLANG_USE_FUSED_PARALLEL_QKNORMUse the fused parallel QK RMSNorm kernel for MiniMax-M2.x on CUDA when attention TP size > 1false
SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_BUSYEnable strict memory checks while the scheduler is busy.0
SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_IDLEEnable strict memory checks while the scheduler is idle.true
SGLANG_NATIVE_MOVE_KV_CACHEUse the native implementation to move KV cache entries.false
SGLANG_USE_BREAKABLE_CUDA_GRAPHUse a breakable CUDA graph so it can be interrupted/rebuilt at runtime.false
SGLANG_MEMORY_SAVER_CUDA_GRAPHAllow CUDA graphs under the release/resume memory saver.false
SGLANG_GEMMA_OUT_OF_PLACE_POSITION_MUTATIONUse out-of-place position mutation for Gemma models.false
SGLANG_MAMBA_CONV_DTYPEdtype for the Mamba convolution state.bfloat16
SGLANG_MAMBA_SSM_DTYPEdtype for the Mamba SSM state (defaults to the model dtype when unset).Not set
SGLANG_EMBEDDINGS_SPARSE_HEADName of the sparse-embeddings head to expose for embedding models.Not set
SGLANG_DSV4_FP4_EXPERTSWhether DeepSeek V4 experts use FP4. Set to false when using an FP4-to-FP8 converted DeepSeek V4 checkpoint.true
SGLANG_DSV4_REASONING_EFFORTDefault reasoning_effort for the DeepSeek V4 chat encoder when a request does not set it (accepts max, high; empty means unset).""

Quantization

Environment VariableDescriptionDefault Value
SGLANG_INT4_WEIGHTEnable INT4 weight quantizationfalse
SGLANG_FORCE_FP8_MARLINForce using FP8 MARLIN kernels even if other FP8 kernels are availablefalse
SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTNQuantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpointfalse
SGLANG_MOE_NVFP4_DISPATCHUse nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend)“false”
SGLANG_FLASHINFER_NVFP4_PER_TOKEN_ACTIVATIONEnable FlashInfer TRTLLM NVFP4 per-token activation scaling; ignores checkpoint activation FP32 scale by treating it as 1false
FLASHINFER_NVFP4_4OVER6Enable FlashInfer NVFP4 4over6 scaling for NVFP4 per-token activation and online NVFP4 MoE weight quantization pathsfalse
FLASHINFER_NVFP4_4OVER6_E4M3_USE_256Use 256 as the E4M3 scale maximum for FlashInfer NVFP4 4over6 scaling; otherwise uses 448false
SGLANG_NVFP4_CKPT_FP8_NEXTN_MOEQuantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpointfalse
SGLANG_QUANT_ALLOW_DOWNCASTINGAllow weight dtype downcasting during loading (e.g., fp32 → fp16). By default, SGLang rejects this kind of downcasting when using quantization.false
SGLANG_FP8_IGNORED_LAYERSA comma-separated list of layer names to ignore during FP8 quantization. For example: model.layers.0,model.layers.1.,qkv_proj.""
SGLANG_FP4_IGNORED_LAYERSA comma-separated list of layer names to keep out of FP4 online quantization, including nvfp4_online. For example: model.layers.40,model.layers.41.""

Distributed Computing

Environment VariableDescriptionDefault Value
SGLANG_BLOCK_NONZERO_RANK_CHILDRENControl blocking of non-zero rank children processes1
SGLANG_IS_FIRST_RANK_ON_NODEIndicates if the current process is the first rank on its node"true"
SGLANG_PP_LAYER_PARTITIONPipeline parallel layer partition specificationNot set
SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESSSet one visible device per process for distributed computingfalse
SGLANG_RAY_BUNDLE_INDICESComma-separated bundle indices for Ray actor placement (e.g., “0,1,2,3”). Must match world_size. Enables fine-grained GPU assignment in custom placement groups.Not set
SGLANG_CPU_QUANTIZATIONEnable CPU-side quantization.false
SGLANG_USE_DYNAMIC_MXFP4_LINEARUse dynamic MXFP4 quantization for linear layers.false
USE_TRITON_W8A8_FP8_KERNELUse the Triton W8A8 FP8 kernel.false
SGLANG_USE_MESSAGE_QUEUE_BROADCASTERUse the shared-memory message-queue broadcaster for inter-process tensor broadcast.true
SGLANG_DISTRIBUTED_INIT_METHOD_OVERRIDEOverride the init method used by torch.distributed.init_process_group. Set to env:// to use an externally-created TCPStore via MASTER_ADDR/MASTER_PORT.Not set
SGLANG_TCP_STORE_PORTPort for the torch.distributed TCPStore.29600
SGLANG_SYNC_TOKEN_IDS_ACROSS_TPSynchronize sampled token ids across tensor-parallel ranks.false

PD Disaggregation — Staging Buffer (Heterogeneous TP)

Environment VariableDescriptionDefault Value
SGLANG_DISAGG_STAGING_BUFFEREnable GPU staging buffer for heterogeneous TP KV transfer. Required when prefill and decode use different TP/attention-TP sizes. Only for non-MLA models (e.g. GQA, MHA).false
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MBPrefill-side per-worker staging buffer size in MB. Used for gathering KV head slices before bulk RDMA transfer.64
SGLANG_DISAGG_STAGING_POOL_SIZE_MBDecode-side ring buffer pool total size in MB. Shared buffer receiving RDMA data from all prefill ranks. Larger values support higher concurrency.4096
SGLANG_STAGING_USE_TORCHForce using PyTorch gather/scatter fallback instead of Triton fused kernels for staging operations. Useful for debugging.false

Testing & Debugging (Internal/CI)

These variables are primarily used for internal testing, continuous integration, or debugging.
Environment VariableDescriptionDefault Value
SGLANG_IS_IN_CIIndicates if running in CI environmentfalse
SGLANG_IS_IN_CI_AMDIndicates running in AMD CI environmentfalse
SGLANG_TEST_RETRACTEnable retract decode testingfalse
SGLANG_TEST_RETRACT_NO_PREFILL_BSWhen SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds SGLANG_TEST_RETRACT_NO_PREFILL_BS.2 ** 31
SGLANG_RECORD_STEP_TIMERecord step time for profilingfalse
SGLANG_TEST_REQUEST_TIME_STATSTest request time statisticsfalse
SGLANG_DEBUG_SYMM_MEMEnable debug checks that verify tensors passed to NCCL communication ops are allocated in the symmetric memory pool. Logs warnings (rank 0 only) with stack traces for any tensor not in the pool.false
SGLANG_KERNEL_API_LOGLEVELControls crash-debug kernel API logging. 0 disables logging, 1 logs API names, 3 logs tensor metadata, 5 adds tensor statistics, and 10 also writes pre-call dump snapshots.0
SGLANG_KERNEL_API_LOGDESTDestination for crash-debug kernel API logs. Use stdout, stderr, or a file path. %i is replaced with the process PID.stdout
SGLANG_KERNEL_API_DUMP_DIROutput directory for level-10 kernel API input/output dumps. %i is replaced with the process PID.sglang_kernel_api_dumps
SGLANG_KERNEL_API_DUMP_INCLUDEComma-separated wildcard patterns for kernel API names to include in level-10 dumps.Not set
SGLANG_KERNEL_API_DUMP_EXCLUDEComma-separated wildcard patterns for kernel API names to exclude from level-10 dumps.Not set

Profiling & Benchmarking

Environment VariableDescriptionDefault Value
SGLANG_TORCH_PROFILER_DIRDirectory for PyTorch profiler output/tmp
SGLANG_PROFILE_WITH_STACKSet with_stack option (bool) for PyTorch profiler (capture stack trace)true
SGLANG_PROFILE_RECORD_SHAPESSet record_shapes option (bool) for PyTorch profiler (record shapes)true
SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLISConfig BatchSpanProcessor.schedule_delay_millis if tracing is enabled500
SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZEConfig BatchSpanProcessor.max_export_batch_size if tracing is enabled64
SGLANG_PROFILE_V2Use the v2 profiler implementation.false
SGLANG_DETECT_SLOW_RANKDetect and report ranks that fall behind during collective ops.false
SGLANG_FORCE_SHUTDOWNForce an immediate process-group shutdown on exit.false
SGLANG_PYSPY_DUMP_BEFORE_CRASHCapture a py-spy stack dump of all processes before crashing.true
SGLANG_CUDA_COREDUMPEnable CUDA coredump generation (auto-injects the required CUDA_* env vars).false
SGLANG_CUDA_COREDUMP_DIRDirectory for CUDA coredumps. If unset, resolves to RUNNER_TEMP in CI, else /tmp.Not set
SGLANG_CUDA_COREDUMP_BEFORE_CRASHTrigger a CUDA coredump before crashing.true
SGLANG_CUDA_COREDUMP_BEFORE_CRASH_WAIT_SECSSeconds to wait for the CUDA coredump to finish before exiting.60.0

Storage & Caching

Environment VariableDescriptionDefault Value
SGLANG_WAIT_WEIGHTS_READY_TIMEOUTTimeout period for waiting on weights120
SGLANG_DISABLE_OUTLINES_DISK_CACHEDisable Outlines disk cachefalse
SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHEUse SGLang’s custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA)false
SGLANG_HICACHE_DECODE_OFFLOAD_STRIDEDecode-side incremental KV cache offload stride. Rounded down to a multiple of —page-size (min is —page-size). If unset/invalid/<=0, it falls back to —page-size.Not set (uses —page-size)
SGLANG_HICACHE_NIXL_USE_DIRECT_IOEnable O_DIRECT for any file-based NIXL backend (POSIX, GDS, GDS_MT, 3FS) when opening cache files (bypasses the OS page cache, reducing memory pressure and improving throughput on NVMe). Can also be disabled via in —hicache-storage-backend-extra-config. Falls back to buffered I/O with a warning when O_DIRECT is unavailable on the current OS.true
SGLANG_HUGEPAGE_SIZEUse huge pages for host KV cache allocations (HiCache / disaggregation offload). Valid values: 2MB (2 MiB pages via MAP_HUGE_2MB) or 1GB (1 GiB pages via MAP_HUGE_1GB). Requires huge pages to be pre-allocated on the host OS (/proc/sys/vm/nr_hugepages or /sys/kernel/mm/hugepages). If the allocation fails, the allocator logs a warning and falls back to regular page-size mmap automatically.Not set (uses OS default page size)
SGLANG_HICACHE_HF3FS_CONFIG_PATHPath to the HiCache HF3FS backend config file.Not set
SGLANG_HICACHE_FILE_BACKEND_STORAGE_DIRStorage directory for the HiCache file backend.Not set
SGLANG_HICACHE_FILE_BACKEND_MAX_SIZEMax size for HiCache file-backend LRU eviction (accepts SI/IEC suffixes; 0 disables eviction).Not set (eviction off)
SGLANG_HICACHE_FILE_BACKEND_EVICTION_RATIOTarget fraction to evict down to when the file-backend max size is reached.0.9
SGLANG_HICACHE_FILE_BACKEND_MIN_FREE_SPACEMinimum free space to keep on the file-backend volume (accepts SI/IEC suffixes).0
SGLANG_HICACHE_NIXL_BACKEND_STORAGE_DIRStorage directory for the HiCache NIXL backend.Not set

Function Calling / Tool Use

Environment VariableDescriptionDefault Value
SGLANG_TOOL_STRICT_LEVELControls the strictness level of tool call parsing and validation. <br>Level 0: Off - No strict validation <br>Level 1: Function strict - Enables structural tag constraints for all tools (even if none have strict=True set) <br>Level 2: Parameter strict - Enforces strict parameter validation for all tools, treating them as if they all have strict=True set0
SGLANG_DEFAULT_THINKINGEnable model thinking/reasoning output by default.false
SGLANG_MAX_THINK_TOKENSCap on thinking tokens. Negative means unlimited; 0 or greater caps the count.-1

Logging & Observability

Environment VariableDescriptionDefault Value
SGLANG_LOG_GCLog Python garbage-collection pauses.false
SGLANG_LOG_FORWARD_ITERSLog each forward iteration.false
SGLANG_LOG_MSLog per-step timing in milliseconds.false
SGLANG_LOG_REQUEST_EXCEEDED_MSLog requests whose processing time exceeds this many milliseconds. -1 disables.-1
SGLANG_LOG_SCHEDULER_STATUS_TARGETTarget (e.g. a file path) for periodic scheduler-status logging.""
SGLANG_LOG_SCHEDULER_STATUS_INTERVALInterval (seconds) between scheduler-status log lines.60.0

Constrained Decoding (Grammar)

Environment VariableDescriptionDefault Value
SGLANG_GRAMMAR_POLL_INTERVALPoll interval (seconds) for asynchronous grammar compilation.0.005
SGLANG_GRAMMAR_MAX_POLL_ITERATIONSMaximum poll iterations before grammar compilation is treated as stuck.10000

Scheduler & Batching

Environment VariableDescriptionDefault Value
SGLANG_INIT_NEW_TOKEN_RATIOInitial new-token ratio used for memory planning.0.7
SGLANG_MIN_NEW_TOKEN_RATIO_FACTORFloor factor for the new-token ratio after decay.0.14
SGLANG_NEW_TOKEN_RATIO_DECAY_STEPSNumber of steps over which the new-token ratio decays.600
SGLANG_RETRACT_DECODE_STEPSNumber of decode steps to look ahead when deciding to retract.20
SGLANG_EMPTY_CACHE_INTERVALInterval (seconds) at which to empty the device cache; set this if memory accumulates over a long serving period. -1 disables.-1
SGLANG_FORCE_STREAM_INTERVALFor non-streaming requests, flush intermediate output batches to the tokenizer manager every N decoded tokens (lower to 1 for accurate TTFT benchmarking).50
SGLANG_DYNAMIC_CHUNKING_SMOOTH_FACTORSmoothing factor for dynamic prefill chunking.0.75
SGLANG_SWA_EVICTION_INTERVAL_MULTIPLIERMultiplier applied to the sliding-window-attention eviction interval.1.0
SGLANG_ENABLE_UNIFIED_RADIX_TREEUse the unified radix-tree cache implementation.false
SGLANG_EXPERIMENTAL_CPP_RADIX_TREEUse the experimental C++ radix-tree implementation.false
SGLANG_RADIX_FORCE_MISSForce radix-cache misses (debugging/benchmarking).false
SGLANG_SCHEDULER_SKIP_ALL_GATHERSkip the scheduler all-gather step.false
SGLANG_ENABLE_WAR_BARRIERForce-enable the write-after-read barrier for the overlap scheduler even when CUDA is not detected (e.g. AMD/ROCm). On CUDA the barrier is always enabled.false
SGLANG_PP_SKIP_PURE_CHUNKED_OUTPUT_COMMIn pipeline parallel, skip output send/recv when a batch is entirely non-final chunked-prefill requests.false
SGLANG_KILLPG_ON_SCHEDULER_EXCEPTIONKill the whole process group when the scheduler raises an exception.false
SGLANG_REQUEST_STATE_WAIT_TIMEOUTTokenizer-manager request-state wait timeout (seconds).4

PD Disaggregation (Runtime)

Environment VariableDescriptionDefault Value
SGLANG_DISAGGREGATION_THREAD_POOL_SIZEThread-pool size for KV transfers. Defaults to a value computed from the CPU count at runtime.Not set (computed at runtime)
SGLANG_DISAGGREGATION_QUEUE_SIZEDisaggregation transfer queue size.4
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUTTimeout (seconds) for the disaggregation bootstrap handshake.300
SGLANG_DISAGGREGATION_WAITING_TIMEOUTTimeout (seconds) for a request waiting on KV transfer.300
SGLANG_DISAGGREGATION_HEARTBEAT_INTERVALInterval (seconds) between disaggregation heartbeats.5.0
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILUREConsecutive heartbeat failures tolerated before a peer is considered dead.2
SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVALInterval (seconds) for cleaning up stale bootstrap entries.120
SGLANG_DISAGGREGATION_NIXL_BACKENDNIXL transport backend for disaggregation.UCX
SGLANG_DISAGGREGATION_NIXL_BACKEND_PARAMSJSON parameters passed to the NIXL backend.
SGLANG_DISAGGREGATION_ALL_CP_RANKS_TRANSFERHave all context-parallel ranks participate in KV transfer.false
SGLANG_DISAGGREGATION_FORCE_QUERY_PREFILL_DP_RANKForce querying the prefill DP rank for routing.false
SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQSExtra slots in req_to_token_pool for decode workers (effective when max_num_reqs greater than 32), letting more KV transfers overlap decode.0

Mooncake KV Store & Transfer

Environment VariableDescriptionDefault Value
SGLANG_HICACHE_MOONCAKE_CONFIG_PATHPath to the HiCache Mooncake store config file.Not set
SGLANG_HICACHE_MOONCAKE_REUSE_TEReuse the Mooncake transfer engine across HiCache operations.true
SGLANG_MOONCAKE_SEND_AUX_TCPSend Mooncake AUX data over TCP.false
SGLANG_ENABLE_FAILED_SESSION_PROBEProbe failed Mooncake sessions for recovery.false
SGLANG_FAILED_SESSION_PROBE_INTERVAL_SInterval (seconds) between failed-session probes.30.0
MOONCAKE_MASTERAddress of the Mooncake master.Not set
MOONCAKE_CLIENTMooncake client identifier.Not set
MOONCAKE_LOCAL_HOSTNAMELocal hostname advertised to Mooncake.localhost
MOONCAKE_TE_META_DATA_SERVERMooncake transfer-engine metadata server.P2PHANDSHAKE
MOONCAKE_GLOBAL_SEGMENT_SIZEMooncake global segment size.4gb
MOONCAKE_PROTOCOLMooncake transport protocol.rdma
MOONCAKE_DEVICEMooncake RDMA device(s).""
MOONCAKE_MASTER_METRICS_PORTPort for Mooncake master metrics.9003
MOONCAKE_CHECK_SERVERCheck connectivity to the Mooncake server on startup.false
MOONCAKE_STANDALONE_STORAGERun Mooncake in standalone storage mode.false
MOONCAKE_ENABLE_SSD_OFFLOADEnable SSD offload in Mooncake.false
MOONCAKE_OFFLOAD_FILE_STORAGE_PATHFile storage path for Mooncake SSD offload.Not set
ENABLE_ASCEND_TRANSFER_WITH_MOONCAKEEnable Ascend NPU transfers via Mooncake.false
ASCEND_NPU_PHY_IDPhysical Ascend NPU id used for Mooncake transfers. -1 auto-detects.-1

Attention & Kernels

Environment VariableDescriptionDefault Value
SGLANG_TRITON_DECODE_ATTN_STATIC_KV_SPLITSUse static KV splits in the Triton decode-attention kernel.false
SGLANG_MUSA_FA3_FORCE_UPDATE_METADATAForce FA3 metadata updates on MThreads MUSA.false
SGLANG_SKIP_SGL_KERNEL_VERSION_CHECKSkip the sgl-kernel version compatibility check.false

Deterministic Inference

Environment VariableDescriptionDefault Value
SGLANG_ENABLE_DETERMINISTIC_INFERENCEEnable deterministic inference (fixed reduction/accumulation order).false
SGLANG_USE_1STAGE_ALLREDUCEUse the 1-stage all-reduce kernel on AMD (deterministic, fixed accumulation order). If unset, it is auto-enabled when deterministic inference is on.false
SGLANG_FLASHINFER_PREFILL_SPLIT_TILE_SIZEFlashInfer prefill split-tile size for deterministic attention.4096
SGLANG_FLASHINFER_DECODE_SPLIT_TILE_SIZEFlashInfer decode split-tile size for deterministic attention.2048
SGLANG_TRITON_PREFILL_TRUNCATION_ALIGN_SIZETriton prefill truncation alignment size for deterministic attention.4096
SGLANG_TRITON_DECODE_SPLIT_TILE_SIZETriton decode split-tile size for deterministic attention.256

Speculative Decoding & Overlap

Environment VariableDescriptionDefault Value
SGLANG_ENABLE_OVERLAP_PLAN_STREAMPlan the next step on a separate stream to overlap with the current step (Overlap Spec V2).false
SGLANG_SPEC_ENABLE_STRICT_FILTER_CHECKEnable strict filter checks in speculative decoding.true
SGLANG_SPEC_SKIP_ZERO_STEP_DRAFT_EXTENDSkip draft_extend while adaptive spec is at steps=0; saves a draft forward but the draft KV goes stale.false
SGLANG_NGRAM_FORCE_GREEDY_VERIFYForce greedy verification for the n-gram speculative path.false
SGLANG_SANITIZE_NAN_LOGITSSanitize NaN logits before sampling kernels and emit a throttled warning.true

EPLB (Expert Parallel Load Balancing)

Environment VariableDescriptionDefault Value
SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIROutput directory for the expert-distribution recorder./tmp
SGLANG_ENABLE_EPLB_BALANCEDNESS_METRICEmit an EPLB balancedness metric.false
SGLANG_LOG_EXPERT_LOCATION_METADATALog expert-location metadata.false
SGLANG_EXPERT_LOCATION_UPDATER_LOG_INPUTLog inputs to the expert-location updater.false
SGLANG_EXPERT_LOCATION_UPDATER_LOG_METRICSLog metrics from the expert-location updater.false

AMD & ROCm

Environment VariableDescriptionDefault Value
SGLANG_USE_AITER_AGUse the AITER all-gather implementation.true
SGLANG_USE_AITER_UNIFIED_ATTNUse the AITER unified attention kernel.false
SGLANG_USE_AITER_FP8_PER_TOKENUse AITER FP8 per-token quantization.false
SGLANG_USE_AITER_MOE_GU_ITLVSelect the AITER MoE gate/up tile layout: true interleaves, false uses the separated layout.true
SGLANG_AITER_FUSE_RMSNORM_PADFuse the residual-add + RMSNorm + zero-pad triplet before the MoE block via the AITER Triton kernel (TP=1, post-attention layernorm path only).false
SGLANG_AITER_KV_CACHE_LAYOUTPhysical layout for the MHA KV cache on AITER: nhd or vectorized_5d (SHUFFLE layout enabling pa_decode_gluon).nhd
SGLANG_ROCM_FUSED_DECODE_MLAUse the fused decode MLA kernel on ROCm.false
SGLANG_ROCM_DISABLE_LINEARQUANTDisable linear-layer quantization on ROCm.false

NPU (Ascend)

Environment VariableDescriptionDefault Value
SGLANG_NPU_DISABLE_ACL_FORMAT_WEIGHTDisable ACL-format weight conversion on NPU.false
SGLANG_NPU_USE_MULTI_STREAMUse multiple streams on NPU.false
SGLANG_NPU_USE_MLAPOUse the MLAPO path on NPU.false
SGLANG_NPU_FUSED_MOE_MODEFused MoE mode selector for NPU.1
SGLANG_NPU_FORWARD_NATIVE_GELUTANHUse the native gelu-tanh activation forward (for Skywork-Reward-Gemma-2-27B-v0.2).false
SGLANG_NPU_FORWARD_NATIVE_GEMMA_RMS_NORMUse the native Gemma RMSNorm forward (for Skywork-Reward-Gemma-2-27B-v0.2).false
SGLANG_USE_AG_AFTER_QLORADelay all-gather until after QLoRA for better DeepSeek V3.2 performance.false
SGLANG_EXPERIMENTAL_LORA_OPTIMaster switch for the experimental TRT-LLM LoRA fast path. When off, all fine-grained opt switches read false.false
SGLANG_ZBAL_LOCAL_MEM_SIZELocal memory size for the ZBAL (zero-buffer accelerate library) path (NPU only).0
SGLANG_ZBAL_BOOTSTRAP_URLBootstrap URL for the ZBAL path (NPU only).""

Apple Silicon (MLX / MPS)

Environment VariableDescriptionDefault Value
SGLANG_USE_MLXUse the MLX backend on Apple Silicon.false
SGLANG_MLX_USE_CUSTOM_ROPEUse the custom RoPE kernel on MLX.false
SGLANG_MLX_FUSE_SWIGLUFuse the SwiGLU activation on MLX.false
SGLANG_MLX_CLEAR_CACHE_STEPSNumber of decode steps between mx.clear_cache() calls. 0 disables cache clearing.256

Multimodal (VLM)

Environment VariableDescriptionDefault Value
SGLANG_VLM_CACHE_SIZE_MBSize (MB) of the VLM feature cache.100
SGLANG_IMAGE_MAX_PIXELSMaximum number of pixels per image before resizing.12845056
SGLANG_RESIZE_RESAMPLEResampling filter used when resizing images (e.g. bilinear, bicubic).""
SGLANG_MM_SKIP_COMPUTE_HASHSkip computing multimodal-item hashes.false
SGLANG_MM_AVOID_RETOKENIZEFor pre-tokenized (list[int]) multimodal prompts, preserve the user’s original tokens to avoid retokenization drift.true
SGLANG_VIT_ENABLE_CUDA_GRAPHCapture the vision encoder (ViT) in a CUDA graph.false
SGLANG_USE_CUDA_IPC_TRANSPORTUse CUDA IPC transport for multimodal-item tensors.false
SGLANG_USE_IPC_POOL_HANDLE_CACHECache CUDA IPC pool handles.false
SGLANG_MM_FEATURE_CACHE_MBSize (MB) of the multimodal feature cache.1024
SGLANG_MM_ITEM_MEM_POOL_RECYCLE_INTERVAL_SECInterval (seconds) for recycling the multimodal-item memory pool.0.05

Encoder / EPD (Multimodal Disaggregation)

Environment VariableDescriptionDefault Value
SGLANG_ENCODER_MM_RECEIVER_MODEEncoder receiver selection used by EPD paths: http or grpc.http
SGLANG_ENCODER_GRPC_TIMEOUT_SECSgRPC timeout (seconds) for encoder communication.60
SGLANG_ENCODER_RECV_TIMEOUTEncoder receive timeout (seconds).180.0
SGLANG_ENCODER_SEND_TIMEOUTEncoder send timeout (seconds).180.0
SGLANG_ENCODER_HTTP_TIMEOUTEncoder HTTP timeout (seconds).1800.0
SGLANG_ENCODER_REQ_TIMEOUTEncoder per-request timeout (seconds).180.0
SGLANG_ENCODER_DISPATCH_MIN_ITEMSMinimum items before the encoder dispatches a batch.2
SGLANG_ENCODER_MAX_BATCH_SIZEMaximum encoder batch size.8
SGLANG_ENCODER_PREPROC_WORKERSNumber of encoder preprocessing workers.8
SGLANG_ENCODER_IMAGE_PROCESSOR_USE_GPURun the image processor on the GPU.false
SGLANG_ENCODER_BOOTSTRAP_HEALTH_CHECK_INTERVALEncoderBootstrapServer health-check interval (seconds). 0 disables it.10.0
SGLANG_ENCODER_BOOTSTRAP_HEALTH_CHECK_TIMEOUTEncoderBootstrapServer health-check timeout (seconds).2.0
SGLANG_EMBEDDING_POOL_SIZE_MBPersistent receiver-side GPU embedding pool size (MB) for Mooncake EPD transport. 0 disables (per-request register/deregister).4096
SGLANG_ENCODER_DP_WORKER_MAX_INFLIGHTMaximum in-flight requests per encoder DP worker.64
SGLANG_BACKUP_PORT_BASEBase port for elastic-EP backup ports.10000

HTTP & gRPC Server

Environment VariableDescriptionDefault Value
SGLANG_TIMEOUT_KEEP_ALIVEHTTP keep-alive timeout (seconds).5
SGLANG_UVICORN_WORKER_HEALTHCHECK_TIMEOUTUvicorn multiprocess supervisor per-worker health-check interval (seconds).10
SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATIONHave the /health endpoint run a generation as part of the check.true
SGLANG_WARMUP_TIMEOUTIf a warmup forward batch takes longer than this many seconds, the server crashes to avoid hanging. -1 disables; increase (e.g. to 1800) to accommodate kernel JIT precompile.-1
SGLANG_ENABLE_GRPCEnable the native gRPC server (internal, not yet user-facing).false
SGLANG_GRPC_PORTPort for the native gRPC server.Not set
SGLANG_GRANIAN_PARENT_PIDParent PID for the Granian HTTP/2 worker supervisor.Not set

NUMA & CPU

Environment VariableDescriptionDefault Value
SGLANG_NUMA_BIND_V2Use the v2 NUMA binding implementation.true
SGLANG_AUTO_NUMA_BINDAutomatically bind processes to NUMA nodes.false
SGLANG_CRASH_ON_NUMA_BIND_FAILURECrash if NUMA binding fails instead of warning.false

Metrics

Environment VariableDescriptionDefault Value
SGLANG_ENABLE_METRICS_DEVICE_TIMEREnable device-timer-based metrics.false
SGLANG_ENABLE_METRICS_DP_ATTENTIONEnable data-parallel attention metrics.false

External Models

Environment VariableDescriptionDefault Value
SGLANG_EXTERNAL_MODEL_PACKAGEPython package providing external model implementations.""
SGLANG_EXTERNAL_MM_MODEL_ARCHExternal multimodal model architecture name.""
SGLANG_EXTERNAL_MM_PROCESSOR_PACKAGEPython package providing the external multimodal processor.""

Plugin System

Environment VariableDescriptionDefault Value
SGLANG_PLATFORMPlatform plugin name to load.""
SGLANG_PLUGINSComma-separated list of plugins to load.""