Environment Variables#
SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time.
Note: SGLang uses two prefixes for environment variables: SGL_ and SGLANG_. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.
General Configuration#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Enable using models from ModelScope |
|
|
Host IP address for the server |
|
|
Port for the server |
auto-detected |
|
Custom logging configuration path |
Not set |
|
Disable request logging |
|
|
Comma-separated list of additional HTTP headers to log when |
Not set |
|
Timeout for health check in seconds |
|
|
The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled. |
|
|
Forward unknown tool calls to clients instead of dropping them |
|
|
Timeout (in seconds) for requests waiting in the queue before being scheduled |
|
|
Timeout (in seconds) for requests running in the decode batch |
|
Performance Tuning#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Control whether to use torch.inference_mode |
|
|
Enable torch.compile |
|
|
Enable CPU affinity setting (often set to |
|
|
Allows the scheduler to overwrite longer context length requests (often set to |
|
|
Control FlashInfer availability check |
|
|
Skip P2P (peer-to-peer) access check |
|
|
Sets the threshold for enabling chunked prefix caching |
|
|
Enable RoPE fusion in Fused Multi-Layer Attention |
|
|
Disable overlap schedule for consecutive prefill batches |
|
|
Set the maximum number of requests per poll, with a negative value indicating no limit |
|
|
Disable Flash Attention 4 warmup passes (set to |
|
|
Interval for DPBudget updates |
|
|
Default weight value for scheduler recv skipper counter (used when forward mode doesn’t match specific modes). Only active when |
|
|
Weight increment for decode forward mode in scheduler recv skipper. Works with |
|
|
Weight increment for target verify forward mode in scheduler recv skipper. Works with |
|
|
Weight increment when forward mode is None in scheduler recv skipper. Works with |
|
|
Size of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. When set to a positive value, temporarily moves features to GPU for faster hash computation, then moves them back to CPU to save GPU memory. Larger features benefit more from GPU hashing. Set to |
|
|
Enable precomputing of hash values for MultimodalDataItem |
|
|
Enable NCCL for gathering when preparing mlp sync batch under overlap scheduler (without this flag gloo is used for gathering) |
|
|
Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg |
|
|
The algorithm of custom all-reduce. Set to |
`` |
DeepGEMM Configuration (Advanced Optimization)#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Enable Just-In-Time compilation of DeepGEMM kernels (enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs when the DeepGEMM package is installed; set to |
|
|
Enable precompilation of DeepGEMM kernels |
|
|
Number of workers for parallel DeepGEMM kernel compilation |
|
|
Indicator flag used during the DeepGEMM precompile script |
|
|
Directory for caching compiled DeepGEMM kernels |
|
|
Use NVRTC (instead of Triton) for JIT compilation (Experimental) |
|
|
Use DeepGEMM for Batched Matrix Multiplication (BMM) operations |
|
|
Precompile less kernels during warmup, which reduces the warmup time from 30min to less than 3min. Might cause performance degradation during runtime. |
|
DeepEP Configuration#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Use Bfloat16 for dispatch |
|
|
The maximum number of dispatched tokens on each GPU |
|
|
The maximum number of dispatched tokens on each GPU for –moe-a2a-backend=flashinfer |
|
|
Number of SMs used for DeepEP combine when single batch overlap is enabled |
|
|
Run shared experts on an alternate stream when single batch overlap is enabled on GB200. When not setting this flag, shared experts and down gemm will be overlapped with DeepEP combine together. |
|
MORI Configuration#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Use FP8 for dispatch |
|
|
Maximum number of dispatch tokens per rank for MORI-EP buffer allocation |
|
|
Threshold for switching between |
|
|
Number of RDMA Queue Pairs (QPs) used per transfer operation |
|
|
Number of RDMA work requests posted in a single batch to each QP |
|
|
Number of worker threads in the RDMA executor thread pool |
|
NSA Backend Configuration (For DeepSeek V3.2)#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Fuse the operation of picking topk logits and picking topk indices from page table |
|
|
Precompute metadata that can be shared among different draft steps when MTP is enabled |
|
Memory Management#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Enable memory pool debugging |
|
|
Clip max new tokens estimation for memory planning |
|
|
Maximum states for detokenizer |
Default value based on system |
|
Enable checks for memory imbalance across Tensor Parallel ranks |
|
|
Configure the custom memory pool type for Mooncake. Supports |
|
Model-Specific Options#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Use AITER optimize implementation |
|
|
Enable MoE padding (sets padding size to 128 if value is |
|
|
Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use –moe-runner-backend=cutlass) |
|
Quantization#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Enable INT4 weight quantization |
|
|
Apply per token group quantization kernel with fused silu and mul and masked m |
|
|
Force using FP8 MARLIN kernels even if other FP8 kernels are available |
|
|
Select backend for |
`` |
|
Quantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint |
|
|
Use nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend) |
|
|
Quantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint |
|
|
Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs. DEPRECATED: Please use |
|
|
Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs. DEPRECATED: Please use |
|
Distributed Computing#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Control blocking of non-zero rank children processes |
|
|
Indicates if the current process is the first rank on its node |
|
|
Pipeline parallel layer partition specification |
Not set |
|
Set one visible device per process for distributed computing |
|
Testing & Debugging (Internal/CI)#
These variables are primarily used for internal testing, continuous integration, or debugging.
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Indicates if running in CI environment |
|
|
Indicates running in AMD CI environment |
|
|
Enable retract decode testing |
|
|
When SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds SGLANG_TEST_RETRACT_NO_PREFILL_BS. |
|
|
Record step time for profiling |
|
|
Test request time statistics |
|
Profiling & Benchmarking#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Directory for PyTorch profiler output |
|
|
Set |
|
|
Set |
|
|
Config BatchSpanProcessor.schedule_delay_millis if tracing is enabled |
|
|
Config BatchSpanProcessor.max_export_batch_size if tracing is enabled |
|
Storage & Caching#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Timeout period for waiting on weights |
|
|
Disable Outlines disk cache |
|
|
Use SGLang’s custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA) |
|
Function Calling / Tool Use#
Environment Variable |
Description |
Default Value |
|---|---|---|
|
Controls the strictness level of tool call parsing and validation. |
|