Environment Variables#

SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time.

Note: SGLang uses two prefixes for environment variables: SGL_ and SGLANG_. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.

General Configuration#

Environment Variable

Description

Default Value

SGLANG_USE_MODELSCOPE

Enable using models from ModelScope

false

SGLANG_HOST_IP

Host IP address for the server

0.0.0.0

SGLANG_PORT

Port for the server

auto-detected

SGLANG_LOGGING_CONFIG_PATH

Custom logging configuration path

Not set

SGLANG_DISABLE_REQUEST_LOGGING

Disable request logging

false

SGLANG_LOG_REQUEST_HEADERS

Comma-separated list of additional HTTP headers to log when --log-requests is enabled. Appends to the default x-smg-routing-key.

Not set

SGLANG_HEALTH_CHECK_TIMEOUT

Timeout for health check in seconds

20

SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL

The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled.

0

SGLANG_FORWARD_UNKNOWN_TOOLS

Forward unknown tool calls to clients instead of dropping them

false (drop unknown tools)

SGLANG_REQ_WAITING_TIMEOUT

Timeout (in seconds) for requests waiting in the queue before being scheduled

-1

SGLANG_REQ_RUNNING_TIMEOUT

Timeout (in seconds) for requests running in the decode batch

-1

Performance Tuning#

Environment Variable

Description

Default Value

SGLANG_ENABLE_TORCH_INFERENCE_MODE

Control whether to use torch.inference_mode

false

SGLANG_ENABLE_TORCH_COMPILE

Enable torch.compile

true

SGLANG_SET_CPU_AFFINITY

Enable CPU affinity setting (often set to 1 in Docker builds)

0

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN

Allows the scheduler to overwrite longer context length requests (often set to 1 in Docker builds)

0

SGLANG_IS_FLASHINFER_AVAILABLE

Control FlashInfer availability check

true

SGLANG_SKIP_P2P_CHECK

Skip P2P (peer-to-peer) access check

false

SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD

Sets the threshold for enabling chunked prefix caching

8192

SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION

Enable RoPE fusion in Fused Multi-Layer Attention

1

SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAP

Disable overlap schedule for consecutive prefill batches

false

SGLANG_SCHEDULER_MAX_RECV_PER_POLL

Set the maximum number of requests per poll, with a negative value indicating no limit

-1

SGLANG_DISABLE_FA4_WARMUP

Disable Flash Attention 4 warmup passes (set to 1, true, yes, or on to disable)

false

SGLANG_DATA_PARALLEL_BUDGET_INTERVAL

Interval for DPBudget updates

1

SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULT

Default weight value for scheduler recv skipper counter (used when forward mode doesn’t match specific modes). Only active when --scheduler-recv-interval > 1. The counter accumulates weights and triggers request polling when reaching the interval threshold.

1000

SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODE

Weight increment for decode forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during decode phase.

1

SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_VERIFY

Weight increment for target verify forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during verification phase.

1

SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONE

Weight increment when forward mode is None in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency when no specific forward mode is active.

1

SGLANG_MM_BUFFER_SIZE_MB

Size of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. When set to a positive value, temporarily moves features to GPU for faster hash computation, then moves them back to CPU to save GPU memory. Larger features benefit more from GPU hashing. Set to 0 to disable.

0

SGLANG_MM_PRECOMPUTE_HASH

Enable precomputing of hash values for MultimodalDataItem

false

SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH

Enable NCCL for gathering when preparing mlp sync batch under overlap scheduler (without this flag gloo is used for gathering)

false

SGLANG_SYMM_MEM_PREALLOC_GB_SIZE

Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg --enable-symm-mem is set.

4

SGLANG_CUSTOM_ALLREDUCE_ALGO

The algorithm of custom all-reduce. Set to oneshot or 1stage to force use one-shot. Set to twoshot or 2stage to force use two-shot.

``

DeepGEMM Configuration (Advanced Optimization)#

Environment Variable

Description

Default Value

SGLANG_ENABLE_JIT_DEEPGEMM

Enable Just-In-Time compilation of DeepGEMM kernels (enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs when the DeepGEMM package is installed; set to "0" to disable)

"true"

SGLANG_JIT_DEEPGEMM_PRECOMPILE

Enable precompilation of DeepGEMM kernels

"true"

SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS

Number of workers for parallel DeepGEMM kernel compilation

4

SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGE

Indicator flag used during the DeepGEMM precompile script

"false"

SGLANG_DG_CACHE_DIR

Directory for caching compiled DeepGEMM kernels

~/.cache/deep_gemm

SGLANG_DG_USE_NVRTC

Use NVRTC (instead of Triton) for JIT compilation (Experimental)

"0"

SGLANG_USE_DEEPGEMM_BMM

Use DeepGEMM for Batched Matrix Multiplication (BMM) operations

"false"

SGLANG_JIT_DEEPGEMM_FAST_WARMUP

Precompile less kernels during warmup, which reduces the warmup time from 30min to less than 3min. Might cause performance degradation during runtime.

"false"

DeepEP Configuration#

Environment Variable

Description

Default Value

SGLANG_DEEPEP_BF16_DISPATCH

Use Bfloat16 for dispatch

"false"

SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK

The maximum number of dispatched tokens on each GPU

"128"

SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANK

The maximum number of dispatched tokens on each GPU for –moe-a2a-backend=flashinfer

"1024"

SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS

Number of SMs used for DeepEP combine when single batch overlap is enabled

"32"

SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO

Run shared experts on an alternate stream when single batch overlap is enabled on GB200. When not setting this flag, shared experts and down gemm will be overlapped with DeepEP combine together.

"false"

MORI Configuration#

Environment Variable

Description

Default Value

SGLANG_MORI_FP8_DISP

Use FP8 for dispatch

"false"

SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK

Maximum number of dispatch tokens per rank for MORI-EP buffer allocation

4096

SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD

Threshold for switching between InterNodeV1 and InterNodeV1LL kernel types. InterNodeV1LL is used if SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK is less than or equal to this threshold; otherwise, InterNodeV1 is used.

256

SGLANG_MORI_QP_PER_TRANSFER

Number of RDMA Queue Pairs (QPs) used per transfer operation

1

SGLANG_MORI_POST_BATCH_SIZE

Number of RDMA work requests posted in a single batch to each QP

-1

SGLANG_MORI_NUM_WORKERS

Number of worker threads in the RDMA executor thread pool

1

NSA Backend Configuration (For DeepSeek V3.2)#

Environment Variable

Description

Default Value

SGLANG_NSA_FUSE_TOPK

Fuse the operation of picking topk logits and picking topk indices from page table

true

SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATA

Precompute metadata that can be shared among different draft steps when MTP is enabled

true

Memory Management#

Environment Variable

Description

Default Value

SGLANG_DEBUG_MEMORY_POOL

Enable memory pool debugging

false

SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION

Clip max new tokens estimation for memory planning

4096

SGLANG_DETOKENIZER_MAX_STATES

Maximum states for detokenizer

Default value based on system

SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK

Enable checks for memory imbalance across Tensor Parallel ranks

true

SGLANG_MOONCAKE_CUSTOM_MEM_POOL

Configure the custom memory pool type for Mooncake. Supports NVLINK, BAREX, INTRA_NODE_NVLINK. If set to true, it defaults to NVLINK.

None

Model-Specific Options#

Environment Variable

Description

Default Value

SGLANG_USE_AITER

Use AITER optimize implementation

false

SGLANG_MOE_PADDING

Enable MoE padding (sets padding size to 128 if value is 1, often set to 1 in Docker builds)

0

SGLANG_CUTLASS_MOE (deprecated)

Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use –moe-runner-backend=cutlass)

false

Quantization#

Environment Variable

Description

Default Value

SGLANG_INT4_WEIGHT

Enable INT4 weight quantization

false

SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2

Apply per token group quantization kernel with fused silu and mul and masked m

false

SGLANG_FORCE_FP8_MARLIN

Force using FP8 MARLIN kernels even if other FP8 kernels are available

false

SGLANG_FLASHINFER_FP4_GEMM_BACKEND (deprecated)

Select backend for mm_fp4 on Blackwell GPUs. DEPRECATED: Please use --fp4-gemm-backend instead.

``

SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN

Quantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint

false

SGLANG_MOE_NVFP4_DISPATCH

Use nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend)

"false"

SGLANG_NVFP4_CKPT_FP8_NEXTN_MOE

Quantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint

false

SGLANG_ENABLE_FLASHINFER_FP8_GEMM (deprecated)

Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs. DEPRECATED: Please use --fp8-gemm-backend=flashinfer_trtllm instead.

false

SGLANG_SUPPORT_CUTLASS_BLOCK_FP8 (deprecated)

Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs. DEPRECATED: Please use --fp8-gemm-backend=cutlass instead.

false

Distributed Computing#

Environment Variable

Description

Default Value

SGLANG_BLOCK_NONZERO_RANK_CHILDREN

Control blocking of non-zero rank children processes

1

SGLANG_IS_FIRST_RANK_ON_NODE

Indicates if the current process is the first rank on its node

"true"

SGLANG_PP_LAYER_PARTITION

Pipeline parallel layer partition specification

Not set

SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESS

Set one visible device per process for distributed computing

false

Testing & Debugging (Internal/CI)#

These variables are primarily used for internal testing, continuous integration, or debugging.

Environment Variable

Description

Default Value

SGLANG_IS_IN_CI

Indicates if running in CI environment

false

SGLANG_IS_IN_CI_AMD

Indicates running in AMD CI environment

0

SGLANG_TEST_RETRACT

Enable retract decode testing

false

SGLANG_TEST_RETRACT_NO_PREFILL_BS

When SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds SGLANG_TEST_RETRACT_NO_PREFILL_BS.

2 ** 31

SGLANG_RECORD_STEP_TIME

Record step time for profiling

false

SGLANG_TEST_REQUEST_TIME_STATS

Test request time statistics

false

Profiling & Benchmarking#

Environment Variable

Description

Default Value

SGLANG_TORCH_PROFILER_DIR

Directory for PyTorch profiler output

/tmp

SGLANG_PROFILE_WITH_STACK

Set with_stack option (bool) for PyTorch profiler (capture stack trace)

true

SGLANG_PROFILE_RECORD_SHAPES

Set record_shapes option (bool) for PyTorch profiler (record shapes)

true

SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS

Config BatchSpanProcessor.schedule_delay_millis if tracing is enabled

500

SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE

Config BatchSpanProcessor.max_export_batch_size if tracing is enabled

64

Storage & Caching#

Environment Variable

Description

Default Value

SGLANG_WAIT_WEIGHTS_READY_TIMEOUT

Timeout period for waiting on weights

120

SGLANG_DISABLE_OUTLINES_DISK_CACHE

Disable Outlines disk cache

true

SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHE

Use SGLang’s custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA)

false

Function Calling / Tool Use#

Environment Variable

Description

Default Value

SGLANG_TOOL_STRICT_LEVEL

Controls the strictness level of tool call parsing and validation.
Level 0: Off - No strict validation
Level 1: Function strict - Enables structural tag constraints for all tools (even if none have strict=True set)
Level 2: Parameter strict - Enforces strict parameter validation for all tools, treating them as if they all have strict=True set

0