Skip to main content
SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time. Note: SGLang uses two prefixes for environment variables: SGL_ and SGLANG_. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.

General Configuration

Environment VariableDescriptionDefault Value
SGLANG_USE_MODELSCOPEEnable using models from ModelScopefalse
SGLANG_HOST_IPHost IP address for the server0.0.0.0
SGLANG_PORTPort for the serverauto-detected
SGLANG_LOGGING_CONFIG_PATHCustom logging configuration pathNot set
SGLANG_DISABLE_REQUEST_LOGGINGDisable request loggingfalse
SGLANG_LOG_REQUEST_HEADERSComma-separated list of additional HTTP headers to log when —log-requests is enabled. Appends to the default x-smg-routing-key.Not set
SGLANG_HEALTH_CHECK_TIMEOUTTimeout for health check in seconds20
SGLANG_EPLB_HEATMAP_COLLECTION_INTERVALThe interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled.0
SGLANG_FORWARD_UNKNOWN_TOOLSForward unknown tool calls to clients instead of dropping themfalse (drop unknown tools)
SGLANG_REQ_WAITING_TIMEOUTTimeout (in seconds) for requests waiting in the queue before being scheduled-1
SGLANG_REQ_RUNNING_TIMEOUTTimeout (in seconds) for requests running in the decode batch-1
SGLANG_CACHE_DIRCache directory for model weights and other data~/.cache/sglang
SGLANG_PREFETCH_BLOCK_SIZE_MBBlock size (in MB) for sequential checkpoint prefetch reads that warm the OS page cache before workers load weights via mmap16

Performance Tuning

Environment VariableDescriptionDefault Value
SGLANG_ENABLE_TORCH_INFERENCE_MODEControl whether to use torch.inference_modefalse
SGLANG_ENABLE_TORCH_COMPILEEnable torch.compilefalse
SGLANG_SET_CPU_AFFINITYEnable CPU affinity setting (often set to 1 in Docker builds)false
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LENAllows the scheduler to overwrite longer context length requests (often set to 1 in Docker builds)false
SGLANG_IS_FLASHINFER_AVAILABLEControl FlashInfer availability checktrue
SGLANG_SKIP_P2P_CHECKSkip P2P (peer-to-peer) access checkfalse
SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLDSets the threshold for enabling chunked prefix caching8192
SGLANG_FUSED_MLA_ENABLE_ROPE_FUSIONEnable RoPE fusion in Fused Multi-Layer Attention1
SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAPDisable overlap schedule for consecutive prefill batchesfalse
SGLANG_SCHEDULER_MAX_RECV_PER_POLLSet the maximum number of requests per poll, with a negative value indicating no limit-1
SGLANG_DISABLE_FA4_WARMUPDisable Flash Attention 4 warmup passes (set to 1, true, yes, or on to disable)false
SGLANG_DATA_PARALLEL_BUDGET_INTERVALInterval for DPBudget updates1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULTDefault weight value for scheduler recv skipper counter (used when forward mode doesn’t match specific modes). Only active when —scheduler-recv-interval > 1. The counter accumulates weights and triggers request polling when reaching the interval threshold.1000
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODEWeight increment for decode forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during decode phase.1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_TARGET_VERIFYWeight increment for target verify forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during verification phase.1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONEWeight increment when forward mode is None in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency when no specific forward mode is active.1
SGLANG_MM_BUFFER_SIZE_MBSize of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. When set to a positive value, temporarily moves features to GPU for faster hash computation, then moves them back to CPU to save GPU memory. Larger features benefit more from GPU hashing. Set to 0 to disable.0
SGLANG_MM_PRECOMPUTE_HASHEnable precomputing of hash values for MultimodalDataItemfalse
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCHEnable NCCL for gathering when preparing mlp sync batch under overlap scheduler (without this flag gloo is used for gathering)false
SGLANG_SYMM_MEM_PREALLOC_GB_SIZESize of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg --enable-symm-mem is set.-1
SGLANG_CUSTOM_ALLREDUCE_ALGOThe algorithm of custom all-reduce. Set to oneshot or 1stage to force use one-shot. Set to twoshot or 2stage to force use two-shot.
SGLANG_SKIP_SOFTMAX_PREFILL_THRESHOLD_SCALE_FACTORSkip-softmax threshold scale factor for TRT-LLM prefill attention in flashinfer. None means standard attention. See https://arxiv.org/abs/2512.12087None
SGLANG_SKIP_SOFTMAX_DECODE_THRESHOLD_SCALE_FACTORSkip-softmax threshold scale factor for TRT-LLM decode attention in flashinfer. None means standard attention. See https://arxiv.org/abs/2512.12087None
SGLANG_USE_SGL_FA3_KERNELUse sgl-kernel implementation for FlashAttention v3true

DeepGEMM Configuration (Advanced Optimization)

Environment VariableDescriptionDefault Value
SGLANG_ENABLE_JIT_DEEPGEMMEnable Just-In-Time compilation of DeepGEMM kernels (enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs when the DeepGEMM package is installed; set to "0" to disable)"true"
SGLANG_JIT_DEEPGEMM_PRECOMPILEEnable precompilation of DeepGEMM kernels"true"
SGLANG_JIT_DEEPGEMM_COMPILE_WORKERSNumber of workers for parallel DeepGEMM kernel compilation4
SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGEIndicator flag used during the DeepGEMM precompile script"false"
SGLANG_DG_CACHE_DIRDirectory for caching compiled DeepGEMM kernels~/.cache/deep_gemm
SGLANG_DG_USE_NVRTCUse NVRTC (instead of Triton) for JIT compilation (Experimental)“false”
SGLANG_USE_DEEPGEMM_BMMUse DeepGEMM for Batched Matrix Multiplication (BMM) operations"false"
SGLANG_JIT_DEEPGEMM_FAST_WARMUPPrecompile less kernels during warmup, which reduces the warmup time from 30min to less than 3min. Might cause performance degradation during runtime."false"

DeepEP Configuration

Environment VariableDescriptionDefault Value
SGLANG_DEEPEP_BF16_DISPATCHUse Bfloat16 for dispatch"false"
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANKThe maximum number of dispatched tokens on each GPU"128"
SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANKThe maximum number of dispatched tokens on each GPU for —moe-a2a-backend=flashinfer"1024"
SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMSNumber of SMs used for DeepEP combine when single batch overlap is enabled"32"
SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBORun shared experts on an alternate stream when single batch overlap is enabled on GB200. When not setting this flag, shared experts and down gemm will be overlapped with DeepEP combine together."false"

MORI Configuration

Environment VariableDescriptionDefault Value
SGLANG_MORI_DISPATCH_DTYPEOverride MoRI-EP dispatch quantization type. auto uses auto-detection from weight dtype; bf16/fp8/fp4 forces the specified type for all layers”auto”
SGLANG_MORI_FP8_COMBUse FP8 for combine”false”
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANKMaximum number of dispatch tokens per rank for MORI-EP buffer allocation4096
SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLDThreshold for switching between InterNodeV1 and InterNodeV1LL kernel types. InterNodeV1LL is used if SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK is less than or equal to this threshold; otherwise, InterNodeV1 is used.256
SGLANG_MORI_PREALLOC_MAX_RECV_TOKENSThis argument devives SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK which indicates customized amount of tokens preallocated for a rank, valid range from 1 to world_size*SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK, by default 0 means maximum. Setting a smaller value will reduce memory footprint but too small value could cause buffer overflow.0
SGLANG_MORI_MOE_MAX_INPUT_TOKENSTruncate the dispatch buffer to this many rows before MoE computation, reducing kernel overhead on padding tokens. The value must be >= the actual number of received tokens (totalRecvTokenNum); setting it too small causes incorrect results. 0 disables truncation (use full buffer).0
SGLANG_MORI_QP_PER_TRANSFERNumber of RDMA Queue Pairs (QPs) used per transfer operation1
SGLANG_MORI_POST_BATCH_SIZENumber of RDMA work requests posted in a single batch to each QP-1
SGLANG_MORI_NUM_WORKERSNumber of worker threads in the RDMA executor thread pool1

NSA Backend Configuration (For DeepSeek V3.2)

Environment VariableDescriptionDefault Value
SGLANG_NSA_FUSE_TOPKFuse the operation of picking topk logits and picking topk indices from page tabletrue
SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATAPrecompute metadata that can be shared among different draft steps when MTP is enabledtrue
SGLANG_USE_FUSED_METADATA_COPYControl whether to use fused metadata copy kernel for cuda graph replaytrue
SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLDWhen the maximum kv len in current prefill batch exceeds this value, the sparse mla kernel will be applied, else it falls back to dense MHA implementation. Default to the index topk of model (2048 for DeepSeek V3.2)2048

Memory Management

Environment VariableDescriptionDefault Value
SGLANG_DEBUG_MEMORY_POOLEnable memory pool debuggingfalse
SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATIONClip max new tokens estimation for memory planning4096
SGLANG_DETOKENIZER_MAX_STATESMaximum states for detokenizerDefault value based on system
SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECKEnable checks for memory imbalance across Tensor Parallel rankstrue
SGLANG_MOONCAKE_CUSTOM_MEM_POOLConfigure the custom memory pool type for Mooncake. Supports NVLINK, BAREX, INTRA_NODE_NVLINK. If set to true, it defaults to NVLINK.None

Model-Specific Options

Environment VariableDescriptionDefault Value
SGLANG_USE_AITERUse AITER optimize implementationfalse
SGLANG_MOE_PADDINGEnable MoE padding (sets padding size to 128 if value is 1, often set to 1 in Docker builds)false
SGLANG_CUTLASS_MOE (deprecated)Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use —moe-runner-backend=cutlass)false

Quantization

Environment VariableDescriptionDefault Value
SGLANG_INT4_WEIGHTEnable INT4 weight quantizationfalse
SGLANG_FORCE_FP8_MARLINForce using FP8 MARLIN kernels even if other FP8 kernels are availablefalse
SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTNQuantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpointfalse
SGLANG_MOE_NVFP4_DISPATCHUse nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend)“false”
SGLANG_NVFP4_CKPT_FP8_NEXTN_MOEQuantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpointfalse
SGLANG_QUANT_ALLOW_DOWNCASTINGAllow weight dtype downcasting during loading (e.g., fp32 → fp16). By default, SGLang rejects this kind of downcasting when using quantization.false
SGLANG_FP8_IGNORED_LAYERSA comma-separated list of layer names to ignore during FP8 quantization. For example: model.layers.0,model.layers.1.,qkv_proj.""

Distributed Computing

Environment VariableDescriptionDefault Value
SGLANG_BLOCK_NONZERO_RANK_CHILDRENControl blocking of non-zero rank children processes1
SGLANG_IS_FIRST_RANK_ON_NODEIndicates if the current process is the first rank on its node"true"
SGLANG_PP_LAYER_PARTITIONPipeline parallel layer partition specificationNot set
SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESSSet one visible device per process for distributed computingfalse

PD Disaggregation — Staging Buffer (Heterogeneous TP)

Environment VariableDescriptionDefault Value
SGLANG_DISAGG_STAGING_BUFFEREnable GPU staging buffer for heterogeneous TP KV transfer. Required when prefill and decode use different TP/attention-TP sizes. Only for non-MLA models (e.g. GQA, MHA).false
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MBPrefill-side per-worker staging buffer size in MB. Used for gathering KV head slices before bulk RDMA transfer.64
SGLANG_DISAGG_STAGING_POOL_SIZE_MBDecode-side ring buffer pool total size in MB. Shared buffer receiving RDMA data from all prefill ranks. Larger values support higher concurrency.4096
SGLANG_STAGING_USE_TORCHForce using PyTorch gather/scatter fallback instead of Triton fused kernels for staging operations. Useful for debugging.false

Testing & Debugging (Internal/CI)

These variables are primarily used for internal testing, continuous integration, or debugging.
Environment VariableDescriptionDefault Value
SGLANG_IS_IN_CIIndicates if running in CI environmentfalse
SGLANG_IS_IN_CI_AMDIndicates running in AMD CI environmentfalse
SGLANG_TEST_RETRACTEnable retract decode testingfalse
SGLANG_TEST_RETRACT_NO_PREFILL_BSWhen SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds SGLANG_TEST_RETRACT_NO_PREFILL_BS.2 ** 31
SGLANG_RECORD_STEP_TIMERecord step time for profilingfalse
SGLANG_TEST_REQUEST_TIME_STATSTest request time statisticsfalse
SGLANG_DEBUG_SYMM_MEMEnable debug checks that verify tensors passed to NCCL communication ops are allocated in the symmetric memory pool. Logs warnings (rank 0 only) with stack traces for any tensor not in the pool.false
SGLANG_KERNEL_API_LOGLEVELControls crash-debug kernel API logging. 0 disables logging, 1 logs API names, 3 logs tensor metadata, 5 adds tensor statistics, and 10 also writes pre-call dump snapshots.0
SGLANG_KERNEL_API_LOGDESTDestination for crash-debug kernel API logs. Use stdout, stderr, or a file path. %i is replaced with the process PID.stdout
SGLANG_KERNEL_API_DUMP_DIROutput directory for level-10 kernel API input/output dumps. %i is replaced with the process PID.sglang_kernel_api_dumps
SGLANG_KERNEL_API_DUMP_INCLUDEComma-separated wildcard patterns for kernel API names to include in level-10 dumps.Not set
SGLANG_KERNEL_API_DUMP_EXCLUDEComma-separated wildcard patterns for kernel API names to exclude from level-10 dumps.Not set

Profiling & Benchmarking

Environment VariableDescriptionDefault Value
SGLANG_TORCH_PROFILER_DIRDirectory for PyTorch profiler output/tmp
SGLANG_PROFILE_WITH_STACKSet with_stack option (bool) for PyTorch profiler (capture stack trace)true
SGLANG_PROFILE_RECORD_SHAPESSet record_shapes option (bool) for PyTorch profiler (record shapes)true
SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLISConfig BatchSpanProcessor.schedule_delay_millis if tracing is enabled500
SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZEConfig BatchSpanProcessor.max_export_batch_size if tracing is enabled64

Storage & Caching

Environment VariableDescriptionDefault Value
SGLANG_WAIT_WEIGHTS_READY_TIMEOUTTimeout period for waiting on weights120
SGLANG_DISABLE_OUTLINES_DISK_CACHEDisable Outlines disk cachefalse
SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHEUse SGLang’s custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA)false
SGLANG_HICACHE_DECODE_OFFLOAD_STRIDEDecode-side incremental KV cache offload stride. Rounded down to a multiple of —page-size (min is —page-size). If unset/invalid/<=0, it falls back to —page-size.Not set (uses —page-size)

Function Calling / Tool Use

Environment VariableDescriptionDefault Value
SGLANG_TOOL_STRICT_LEVELControls the strictness level of tool call parsing and validation. <br>Level 0: Off - No strict validation <br>Level 1: Function strict - Enables structural tag constraints for all tools (even if none have strict=True set) <br>Level 2: Parameter strict - Enforces strict parameter validation for all tools, treating them as if they all have strict=True set0