SGLANG_. The legacy SGL_ prefix is deprecated: any SGL_* variable set in the environment is automatically rewritten to its SGLANG_* equivalent at import time with a deprecation warning, and the alias will be removed in a future release. A few variables keep an upstream/vendor prefix (e.g. MOONCAKE_*, ASCEND_*) because that is their canonical name.
General Configuration
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_USE_MODELSCOPE | Enable using models from ModelScope | false |
SGLANG_HOST_IP | Host IP address for the server | 0.0.0.0 |
SGLANG_PORT | Port for the server | auto-detected |
SGLANG_LOGGING_CONFIG_PATH | Custom logging configuration path | Not set |
SGLANG_LOG_REQUEST_HEADERS | Comma-separated list of additional HTTP headers to log when —log-requests is enabled. Appends to the default x-smg-routing-key. | Not set |
SGLANG_HEALTH_CHECK_TIMEOUT | Timeout for health check in seconds | 20 |
SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL | The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled. | 0 |
SGLANG_EPLB_ROCM_P2P_BATCH_CHUNK_SIZE | Number of logical expert IDs per batch when submitting P2P ops during EPLB rebalance on ROCm. Smaller values prevent RCCL GPU-side accumulation hangs but increase overhead. | 32 |
SGLANG_FORWARD_UNKNOWN_TOOLS | Forward unknown tool calls to clients instead of dropping them | false (drop unknown tools) |
SGLANG_REQ_WAITING_TIMEOUT | Timeout (in seconds) for requests waiting in the queue before being scheduled | -1 |
SGLANG_REQ_RUNNING_TIMEOUT | Timeout (in seconds) for requests running in the decode batch | -1 |
SGLANG_CACHE_DIR | Cache directory for model weights and other data | ~/.cache/sglang |
SGLANG_PREFETCH_BLOCK_SIZE_MB | Block size (in MB) for sequential checkpoint prefetch reads that warm the OS page cache before workers load weights via mmap | 16 |
Performance Tuning
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_ENABLE_TORCH_INFERENCE_MODE | Control whether to use torch.inference_mode | false |
SGLANG_ENABLE_TORCH_COMPILE | Enable torch.compile | false |
SGLANG_SET_CPU_AFFINITY | Enable CPU affinity setting (often set to 1 in Docker builds) | false |
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN | Allows the scheduler to overwrite longer context length requests (often set to 1 in Docker builds) | false |
SGLANG_IS_FLASHINFER_AVAILABLE | Control FlashInfer availability check | true |
SGLANG_FLASHINFER_AUTOTUNE_CACHE | Reuse persisted FlashInfer autotune results from SGLANG_CACHE_DIR across runs. Set to 0 to force re-autotuning on every startup; the fresh result is written to a runs/<rank>.<timestamp>.json sibling file (the canonical cache is left untouched). | true |
SGLANG_SKIP_P2P_CHECK | Skip P2P (peer-to-peer) access check | false |
SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD | Sets the threshold for enabling chunked prefix caching | 8192 |
SGLANG_MAX_KV_CHUNK_CAPACITY | Maximum number of tokens in each KV chunk for DeepSeek MHA chunked prefix cache | 131072 |
SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION | Enable RoPE fusion in Fused Multi-Layer Attention | 1 |
SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAP | Disable overlap schedule for consecutive prefill batches | false |
SGLANG_SCHEDULER_MAX_RECV_PER_POLL | Set the maximum number of requests per poll, with a negative value indicating no limit | -1 |
SGLANG_DATA_PARALLEL_BUDGET_INTERVAL | Interval for DPBudget updates | 1 |
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULT | Default weight value for scheduler recv skipper counter (used when forward mode doesn’t match specific modes). Only active when —scheduler-recv-interval > 1. The counter accumulates weights and triggers request polling when reaching the interval threshold. | 1000 |
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODE | Weight increment for decode forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during decode phase. | 1 |
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_TARGET_VERIFY | Weight increment for target verify forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during verification phase. | 1 |
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONE | Weight increment when forward mode is None in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency when no specific forward mode is active. | 1 |
SGLANG_MM_BUFFER_SIZE_MB | Size of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. When set to a positive value, temporarily moves features to GPU for faster hash computation, then moves them back to CPU to save GPU memory. Larger features benefit more from GPU hashing. Set to 0 to disable. | 0 |
SGLANG_MM_PRECOMPUTE_HASH | Enable precomputing of hash values for MultimodalDataItem | false |
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH | Enable NCCL for gathering when preparing mlp sync batch under overlap scheduler (without this flag gloo is used for gathering) | false |
SGLANG_SYMM_MEM_PREALLOC_GB_SIZE | Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg --enable-symm-mem is set. | -1 |
SGLANG_SKIP_SOFTMAX_PREFILL_THRESHOLD_SCALE_FACTOR | Skip-softmax threshold scale factor for TRT-LLM prefill attention in flashinfer. None means standard attention. See https://arxiv.org/abs/2512.12087 | None |
SGLANG_SKIP_SOFTMAX_DECODE_THRESHOLD_SCALE_FACTOR | Skip-softmax threshold scale factor for TRT-LLM decode attention in flashinfer. None means standard attention. See https://arxiv.org/abs/2512.12087 | None |
SGLANG_USE_SGL_FA3_KERNEL | Use sgl-kernel implementation for FlashAttention v3 | true |
DeepGEMM Configuration (Advanced Optimization)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_ENABLE_JIT_DEEPGEMM | Enable Just-In-Time compilation of DeepGEMM kernels (enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs when the DeepGEMM package is installed; set to "0" to disable) | "true" |
SGLANG_JIT_DEEPGEMM_PRECOMPILE | Enable precompilation of DeepGEMM kernels | "true" |
SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS | Number of workers for parallel DeepGEMM kernel compilation | 4 |
SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGE | Indicator flag used during the DeepGEMM precompile script | "false" |
SGLANG_DG_CACHE_DIR | Directory for caching compiled DeepGEMM kernels | ~/.cache/deep_gemm |
SGLANG_DG_USE_NVRTC | Use NVRTC (instead of Triton) for JIT compilation (Experimental) | “false” |
SGLANG_USE_DEEPGEMM_BMM | Use DeepGEMM for Batched Matrix Multiplication (BMM) operations | "false" |
SGLANG_JIT_DEEPGEMM_FAST_WARMUP | Precompile less kernels during warmup, which reduces the warmup time from 30min to less than 3min. Might cause performance degradation during runtime. | "false" |
DeepEP Configuration
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK | The maximum number of dispatched tokens on each GPU | "128" |
SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANK | The maximum number of dispatched tokens on each GPU for —moe-a2a-backend=flashinfer | "1024" |
SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS | Number of SMs used for DeepEP combine when single batch overlap is enabled | "32" |
SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO | Run shared experts on an alternate stream when single batch overlap is enabled on GB200. When not setting this flag, shared experts and down gemm will be overlapped with DeepEP combine together. | "false" |
SGLANG_DISABLED_MODEL_ARCHS | Comma-separated list of model architectures to disable from auto-registration. | Not set |
SGLANG_SORT_WEIGHT_FILES | Controls weight-file ordering for load-time I/O optimization. -1 disables sorting/staggering (original order); 0 sorts files only; a value k greater than 0 sorts and staggers per-rank order with factor k for better multi-rank I/O concurrency. | 0 |
SGLANG_RETURN_ORIGINAL_LOGPROB | Return the original (pre-temperature) logprobs instead of the post-sampling values. | false |
SGLANG_ENABLE_COLOCATED_BATCH_GEN | Enable colocated batch generation. | false |
SGLANG_ENABLE_MOE_DEFERRED_FINALIZE | Defer the MoE finalize step to overlap it with other work. | false |
SGLANG_PATCH_TOKENIZER | Patch the tokenizer to cache all_special_tokens/all_special_ids (notably for Kimi tiktoken, where ITL can otherwise regress under high batch size). | true |
SGLANG_ENABLE_LOGITS_PROCESSER_CHUNK | Process logits in chunks to reduce peak memory. | false |
SGLANG_LOGITS_PROCESSER_CHUNK_SIZE | Chunk size (in tokens) used when logits-processor chunking is enabled. | 2048 |
SGLANG_FLASHINFER_USE_PAGED | Use the paged FlashInfer attention path. | false |
SGLANG_FLASHINFER_WORKSPACE_SIZE | FlashInfer workspace size in bytes (default ≈ 384 MiB). | 402653184 |
SGLANG_PREP_IN_CUDA_GRAPH | Capture input preparation inside the CUDA graph. | true |
SGLANG_EAGER_INPUT_NO_COPY | In eager forward, wrap the ForwardBatch’s own tensors instead of copying them into the CUDA graph buffer registry (skips a per-iter device-to-device copy). | false |
SGLANG_DEEPGEMM_SANITY_CHECK | Run extra sanity checks on DeepGEMM kernels. | false |
SGLANG_DEEPGEMM_PDL | Enable Programmatic Dependent Launch (PDL) for DeepGEMM kernels. | true |
SGLANG_PP_PARALLEL_DEEPGEMM_WARMUP | Run DeepGEMM warmup in parallel across pipeline-parallel ranks. | false |
SGLANG_DISABLE_STATIC_WATERFILL | Force dynamic DeepEP waterfill with runtime EP all-reduce instead of the default static local-batch path. | false |
SGLANG_NIXL_EP_BF16_DISPATCH | Use BF16 for NIXL-EP dispatch. | false |
SGLANG_NIXL_EP_NUM_MAX_DISPATCH_TOKENS_PER_RANK | Maximum number of dispatched tokens per GPU for NIXL-EP. | 128 |
MORI Configuration
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_MORI_DISPATCH_DTYPE | Override MoRI-EP dispatch quantization type. auto uses auto-detection from weight dtype; bf16/fp8/fp4 forces the specified type for all layers | ”auto” |
SGLANG_MORI_FP8_COMB | Use FP8 for combine | ”false” |
MORI_DISABLE_AUTO_XGMI | Set to 0 to allow Mori to automatically use XGMI for same-node PD disaggregation when no active RDMA device is available. | unset |
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK | Maximum number of dispatch tokens per rank for MORI-EP buffer allocation | 4096 |
SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD | Threshold for switching between InterNodeV1 and InterNodeV1LL kernel types. InterNodeV1LL is used if SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK is less than or equal to this threshold; otherwise, InterNodeV1 is used. | 256 |
SGLANG_MORI_PREALLOC_MAX_RECV_TOKENS | This argument devives SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK which indicates customized amount of tokens preallocated for a rank, valid range from 1 to world_size*SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK, by default 0 means maximum. Setting a smaller value will reduce memory footprint but too small value could cause buffer overflow. | 0 |
SGLANG_MORI_MOE_MAX_INPUT_TOKENS | Truncate the dispatch buffer to this many rows before MoE computation, reducing kernel overhead on padding tokens. The value must be >= the actual number of received tokens (totalRecvTokenNum); setting it too small causes incorrect results. 0 disables truncation (use full buffer). | 0 |
SGLANG_MORI_QP_PER_TRANSFER | Number of RDMA Queue Pairs (QPs) used per transfer operation | 1 |
SGLANG_MORI_POST_BATCH_SIZE | Number of RDMA work requests posted in a single batch to each QP | -1 |
SGLANG_MORI_NUM_WORKERS | Number of worker threads in the RDMA executor thread pool | 1 |
DSA Backend Configuration (For DeepSeek V3.2)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_DSA_FUSE_TOPK | Fuse the operation of picking topk logits and picking topk indices from page table. SGLANG_NSA_FUSE_TOPK is a deprecated alias. | true |
SGLANG_DSA_TOPK_FLASHINFER_DETERMINISTIC | Use deterministic FlashInfer topk kernels when —dsa-topk-backend=flashinfer. | false |
SGLANG_DSA_TOPK_FLASHINFER_TIE_BREAK | Tie-break mode for FlashInfer DSA topk when —dsa-topk-backend=flashinfer: unset disables explicit tie-breaking, small prefers the smaller candidate index for equal scores, and large prefers the larger candidate index for equal scores. Setting this variable makes FlashInfer use deterministic topk. | unset |
SGLANG_DSA_ENABLE_MTP_PRECOMPUTE_METADATA | Precompute metadata that can be shared among different draft steps when MTP is enabled. SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATA is a deprecated alias. | true |
SGLANG_USE_FUSED_METADATA_COPY | Control whether to use fused metadata copy kernel for cuda graph replay | true |
SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD | When the maximum kv len in current prefill batch exceeds this value, the sparse mla kernel will be applied, else it falls back to dense MHA implementation. Default to the index topk of model (2048 for DeepSeek V3.2). SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD is a deprecated alias. | 2048 |
SGLANG_DSA_TOPK_BROADCAST | Experimental. When enabled, broadcast the finalized NSA/DSA indexer top-k result from attention TP rank 0 to the other attention TP ranks. This can mitigate top-k mismatches in TP attention runs at the cost of some speed. | false |
SGLANG_MORI_SEND_AUX_RDMA | Send CPU-resident AUX data via RDMA instead of ZMQ TCP. | false |
SGLANG_MORI_TRANSFER_SHARDS | Number of sharded synchronous worker threads draining KV transfers; also bounds outstanding transfers (primary RDMA send-queue throttle). | 8 |
SGLANG_MORI_WAIT_POLL_MS | Poll cadence (ms) at which a transfer worker wakes to check the SLA while waiting for completion. | 1000 |
SGLANG_MORI_TRANSFER_TIMEOUT_MS | Per-transfer SLA (ms) before a KV transfer is failed; 0 disables the SLA. | 0 |
SGLANG_DSA_HIP_DISABLE_PRESHUFFLE | Disable weight pre-shuffle on the HIP DSA path. SGLANG_NSA_HIP_DISABLE_PRESHUFFLE is a deprecated alias. | false |
SGLANG_DSA_MQA_LOGITS_FREE_MEM_FRACTION | Fraction of free memory the MQA-logits step may use on the DSA path. | 0.2 |
Memory Management
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_DEBUG_MEMORY_POOL | Enable memory pool debugging | false |
SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION | Clip max new tokens estimation for memory planning | 4096 |
SGLANG_DETOKENIZER_MAX_STATES | Maximum states for detokenizer | Default value based on system |
SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK | Enable checks for memory imbalance across Tensor Parallel ranks | true |
SGLANG_MOONCAKE_CUSTOM_MEM_POOL | Configure the custom memory pool type for Mooncake. Supports NVLINK, BAREX, INTRA_NODE_NVLINK. If set to true, it defaults to NVLINK. | None |
Model-Specific Options
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_USE_AITER | Use AITER optimize implementation | false |
SGLANG_ROCM_USE_MULTI_STREAM | Allocate alt CUDA/HIP stream on ROCm/AITER to overlap shared and routed experts in DeepseekV2 MoE. Requires the HIP env GPU_MAX_HW_QUEUES>=5 (default 4, the cap on HSA/ROCr HW queues HIP creates) so the alt stream gets its own queue instead of serializing with the main stream. Best paired with —deepep-mode low_latency so Mori’s AsyncLL kernel offloads dispatch/combine to copy engines and frees CUs. | false |
SGLANG_MOE_PADDING | Enable MoE padding (sets padding size to 128 if value is 1, often set to 1 in Docker builds) | false |
SGLANG_CUTLASS_MOE (deprecated) | Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use —moe-runner-backend=cutlass) | false |
SGLANG_USE_FUSED_PARALLEL_QKNORM | Use the fused parallel QK RMSNorm kernel for MiniMax-M2.x on CUDA when attention TP size > 1 | false |
SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_BUSY | Enable strict memory checks while the scheduler is busy. | 0 |
SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_IDLE | Enable strict memory checks while the scheduler is idle. | true |
SGLANG_NATIVE_MOVE_KV_CACHE | Use the native implementation to move KV cache entries. | false |
SGLANG_USE_BREAKABLE_CUDA_GRAPH | Use a breakable CUDA graph so it can be interrupted/rebuilt at runtime. | false |
SGLANG_MEMORY_SAVER_CUDA_GRAPH | Allow CUDA graphs under the release/resume memory saver. | false |
SGLANG_GEMMA_OUT_OF_PLACE_POSITION_MUTATION | Use out-of-place position mutation for Gemma models. | false |
SGLANG_MAMBA_CONV_DTYPE | dtype for the Mamba convolution state. | bfloat16 |
SGLANG_MAMBA_SSM_DTYPE | dtype for the Mamba SSM state (defaults to the model dtype when unset). | Not set |
SGLANG_EMBEDDINGS_SPARSE_HEAD | Name of the sparse-embeddings head to expose for embedding models. | Not set |
SGLANG_DSV4_FP4_EXPERTS | Whether DeepSeek V4 experts use FP4. Set to false when using an FP4-to-FP8 converted DeepSeek V4 checkpoint. | true |
SGLANG_DSV4_REASONING_EFFORT | Default reasoning_effort for the DeepSeek V4 chat encoder when a request does not set it (accepts max, high; empty means unset). | "" |
Quantization
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_INT4_WEIGHT | Enable INT4 weight quantization | false |
SGLANG_FORCE_FP8_MARLIN | Force using FP8 MARLIN kernels even if other FP8 kernels are available | false |
SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN | Quantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint | false |
SGLANG_MOE_NVFP4_DISPATCH | Use nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend) | “false” |
SGLANG_FLASHINFER_NVFP4_PER_TOKEN_ACTIVATION | Enable FlashInfer TRTLLM NVFP4 per-token activation scaling; ignores checkpoint activation FP32 scale by treating it as 1 | false |
FLASHINFER_NVFP4_4OVER6 | Enable FlashInfer NVFP4 4over6 scaling for NVFP4 per-token activation and online NVFP4 MoE weight quantization paths | false |
FLASHINFER_NVFP4_4OVER6_E4M3_USE_256 | Use 256 as the E4M3 scale maximum for FlashInfer NVFP4 4over6 scaling; otherwise uses 448 | false |
SGLANG_NVFP4_CKPT_FP8_NEXTN_MOE | Quantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint | false |
SGLANG_QUANT_ALLOW_DOWNCASTING | Allow weight dtype downcasting during loading (e.g., fp32 → fp16). By default, SGLang rejects this kind of downcasting when using quantization. | false |
SGLANG_FP8_IGNORED_LAYERS | A comma-separated list of layer names to ignore during FP8 quantization. For example: model.layers.0,model.layers.1.,qkv_proj. | "" |
SGLANG_FP4_IGNORED_LAYERS | A comma-separated list of layer names to keep out of FP4 online quantization, including nvfp4_online. For example: model.layers.40,model.layers.41. | "" |
Distributed Computing
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_BLOCK_NONZERO_RANK_CHILDREN | Control blocking of non-zero rank children processes | 1 |
SGLANG_IS_FIRST_RANK_ON_NODE | Indicates if the current process is the first rank on its node | "true" |
SGLANG_PP_LAYER_PARTITION | Pipeline parallel layer partition specification | Not set |
SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESS | Set one visible device per process for distributed computing | false |
SGLANG_RAY_BUNDLE_INDICES | Comma-separated bundle indices for Ray actor placement (e.g., “0,1,2,3”). Must match world_size. Enables fine-grained GPU assignment in custom placement groups. | Not set |
SGLANG_CPU_QUANTIZATION | Enable CPU-side quantization. | false |
SGLANG_USE_DYNAMIC_MXFP4_LINEAR | Use dynamic MXFP4 quantization for linear layers. | false |
USE_TRITON_W8A8_FP8_KERNEL | Use the Triton W8A8 FP8 kernel. | false |
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER | Use the shared-memory message-queue broadcaster for inter-process tensor broadcast. | true |
SGLANG_DISTRIBUTED_INIT_METHOD_OVERRIDE | Override the init method used by torch.distributed.init_process_group. Set to env:// to use an externally-created TCPStore via MASTER_ADDR/MASTER_PORT. | Not set |
SGLANG_TCP_STORE_PORT | Port for the torch.distributed TCPStore. | 29600 |
SGLANG_SYNC_TOKEN_IDS_ACROSS_TP | Synchronize sampled token ids across tensor-parallel ranks. | false |
PD Disaggregation — Staging Buffer (Heterogeneous TP)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_DISAGG_STAGING_BUFFER | Enable GPU staging buffer for heterogeneous TP KV transfer. Required when prefill and decode use different TP/attention-TP sizes. Only for non-MLA models (e.g. GQA, MHA). | false |
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB | Prefill-side per-worker staging buffer size in MB. Used for gathering KV head slices before bulk RDMA transfer. | 64 |
SGLANG_DISAGG_STAGING_POOL_SIZE_MB | Decode-side ring buffer pool total size in MB. Shared buffer receiving RDMA data from all prefill ranks. Larger values support higher concurrency. | 4096 |
SGLANG_STAGING_USE_TORCH | Force using PyTorch gather/scatter fallback instead of Triton fused kernels for staging operations. Useful for debugging. | false |
Testing & Debugging (Internal/CI)
These variables are primarily used for internal testing, continuous integration, or debugging.| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_IS_IN_CI | Indicates if running in CI environment | false |
SGLANG_IS_IN_CI_AMD | Indicates running in AMD CI environment | false |
SGLANG_TEST_RETRACT | Enable retract decode testing | false |
SGLANG_TEST_RETRACT_NO_PREFILL_BS | When SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds SGLANG_TEST_RETRACT_NO_PREFILL_BS. | 2 ** 31 |
SGLANG_RECORD_STEP_TIME | Record step time for profiling | false |
SGLANG_TEST_REQUEST_TIME_STATS | Test request time statistics | false |
SGLANG_DEBUG_SYMM_MEM | Enable debug checks that verify tensors passed to NCCL communication ops are allocated in the symmetric memory pool. Logs warnings (rank 0 only) with stack traces for any tensor not in the pool. | false |
SGLANG_KERNEL_API_LOGLEVEL | Controls crash-debug kernel API logging. 0 disables logging, 1 logs API names, 3 logs tensor metadata, 5 adds tensor statistics, and 10 also writes pre-call dump snapshots. | 0 |
SGLANG_KERNEL_API_LOGDEST | Destination for crash-debug kernel API logs. Use stdout, stderr, or a file path. %i is replaced with the process PID. | stdout |
SGLANG_KERNEL_API_DUMP_DIR | Output directory for level-10 kernel API input/output dumps. %i is replaced with the process PID. | sglang_kernel_api_dumps |
SGLANG_KERNEL_API_DUMP_INCLUDE | Comma-separated wildcard patterns for kernel API names to include in level-10 dumps. | Not set |
SGLANG_KERNEL_API_DUMP_EXCLUDE | Comma-separated wildcard patterns for kernel API names to exclude from level-10 dumps. | Not set |
Profiling & Benchmarking
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_TORCH_PROFILER_DIR | Directory for PyTorch profiler output | /tmp |
SGLANG_PROFILE_WITH_STACK | Set with_stack option (bool) for PyTorch profiler (capture stack trace) | true |
SGLANG_PROFILE_RECORD_SHAPES | Set record_shapes option (bool) for PyTorch profiler (record shapes) | true |
SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS | Config BatchSpanProcessor.schedule_delay_millis if tracing is enabled | 500 |
SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE | Config BatchSpanProcessor.max_export_batch_size if tracing is enabled | 64 |
SGLANG_PROFILE_V2 | Use the v2 profiler implementation. | false |
SGLANG_DETECT_SLOW_RANK | Detect and report ranks that fall behind during collective ops. | false |
SGLANG_FORCE_SHUTDOWN | Force an immediate process-group shutdown on exit. | false |
SGLANG_PYSPY_DUMP_BEFORE_CRASH | Capture a py-spy stack dump of all processes before crashing. | true |
SGLANG_CUDA_COREDUMP | Enable CUDA coredump generation (auto-injects the required CUDA_* env vars). | false |
SGLANG_CUDA_COREDUMP_DIR | Directory for CUDA coredumps. If unset, resolves to RUNNER_TEMP in CI, else /tmp. | Not set |
SGLANG_CUDA_COREDUMP_BEFORE_CRASH | Trigger a CUDA coredump before crashing. | true |
SGLANG_CUDA_COREDUMP_BEFORE_CRASH_WAIT_SECS | Seconds to wait for the CUDA coredump to finish before exiting. | 60.0 |
Storage & Caching
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_WAIT_WEIGHTS_READY_TIMEOUT | Timeout period for waiting on weights | 120 |
SGLANG_DISABLE_OUTLINES_DISK_CACHE | Disable Outlines disk cache | false |
SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHE | Use SGLang’s custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA) | false |
SGLANG_HICACHE_DECODE_OFFLOAD_STRIDE | Decode-side incremental KV cache offload stride. Rounded down to a multiple of —page-size (min is —page-size). If unset/invalid/<=0, it falls back to —page-size. | Not set (uses —page-size) |
SGLANG_HICACHE_NIXL_USE_DIRECT_IO | Enable O_DIRECT for any file-based NIXL backend (POSIX, GDS, GDS_MT, 3FS) when opening cache files (bypasses the OS page cache, reducing memory pressure and improving throughput on NVMe). Can also be disabled via in —hicache-storage-backend-extra-config. Falls back to buffered I/O with a warning when O_DIRECT is unavailable on the current OS. | true |
SGLANG_HUGEPAGE_SIZE | Use huge pages for host KV cache allocations (HiCache / disaggregation offload). Valid values: 2MB (2 MiB pages via MAP_HUGE_2MB) or 1GB (1 GiB pages via MAP_HUGE_1GB). Requires huge pages to be pre-allocated on the host OS (/proc/sys/vm/nr_hugepages or /sys/kernel/mm/hugepages). If the allocation fails, the allocator logs a warning and falls back to regular page-size mmap automatically. | Not set (uses OS default page size) |
SGLANG_HICACHE_HF3FS_CONFIG_PATH | Path to the HiCache HF3FS backend config file. | Not set |
SGLANG_HICACHE_FILE_BACKEND_STORAGE_DIR | Storage directory for the HiCache file backend. | Not set |
SGLANG_HICACHE_FILE_BACKEND_MAX_SIZE | Max size for HiCache file-backend LRU eviction (accepts SI/IEC suffixes; 0 disables eviction). | Not set (eviction off) |
SGLANG_HICACHE_FILE_BACKEND_EVICTION_RATIO | Target fraction to evict down to when the file-backend max size is reached. | 0.9 |
SGLANG_HICACHE_FILE_BACKEND_MIN_FREE_SPACE | Minimum free space to keep on the file-backend volume (accepts SI/IEC suffixes). | 0 |
SGLANG_HICACHE_NIXL_BACKEND_STORAGE_DIR | Storage directory for the HiCache NIXL backend. | Not set |
Function Calling / Tool Use
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_TOOL_STRICT_LEVEL | Controls the strictness level of tool call parsing and validation. <br>Level 0: Off - No strict validation <br>Level 1: Function strict - Enables structural tag constraints for all tools (even if none have strict=True set) <br>Level 2: Parameter strict - Enforces strict parameter validation for all tools, treating them as if they all have strict=True set | 0 |
SGLANG_DEFAULT_THINKING | Enable model thinking/reasoning output by default. | false |
SGLANG_MAX_THINK_TOKENS | Cap on thinking tokens. Negative means unlimited; 0 or greater caps the count. | -1 |
Logging & Observability
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_LOG_GC | Log Python garbage-collection pauses. | false |
SGLANG_LOG_FORWARD_ITERS | Log each forward iteration. | false |
SGLANG_LOG_MS | Log per-step timing in milliseconds. | false |
SGLANG_LOG_REQUEST_EXCEEDED_MS | Log requests whose processing time exceeds this many milliseconds. -1 disables. | -1 |
SGLANG_LOG_SCHEDULER_STATUS_TARGET | Target (e.g. a file path) for periodic scheduler-status logging. | "" |
SGLANG_LOG_SCHEDULER_STATUS_INTERVAL | Interval (seconds) between scheduler-status log lines. | 60.0 |
Constrained Decoding (Grammar)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_GRAMMAR_POLL_INTERVAL | Poll interval (seconds) for asynchronous grammar compilation. | 0.005 |
SGLANG_GRAMMAR_MAX_POLL_ITERATIONS | Maximum poll iterations before grammar compilation is treated as stuck. | 10000 |
Scheduler & Batching
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_INIT_NEW_TOKEN_RATIO | Initial new-token ratio used for memory planning. | 0.7 |
SGLANG_MIN_NEW_TOKEN_RATIO_FACTOR | Floor factor for the new-token ratio after decay. | 0.14 |
SGLANG_NEW_TOKEN_RATIO_DECAY_STEPS | Number of steps over which the new-token ratio decays. | 600 |
SGLANG_RETRACT_DECODE_STEPS | Number of decode steps to look ahead when deciding to retract. | 20 |
SGLANG_EMPTY_CACHE_INTERVAL | Interval (seconds) at which to empty the device cache; set this if memory accumulates over a long serving period. -1 disables. | -1 |
SGLANG_FORCE_STREAM_INTERVAL | For non-streaming requests, flush intermediate output batches to the tokenizer manager every N decoded tokens (lower to 1 for accurate TTFT benchmarking). | 50 |
SGLANG_DYNAMIC_CHUNKING_SMOOTH_FACTOR | Smoothing factor for dynamic prefill chunking. | 0.75 |
SGLANG_SWA_EVICTION_INTERVAL_MULTIPLIER | Multiplier applied to the sliding-window-attention eviction interval. | 1.0 |
SGLANG_ENABLE_UNIFIED_RADIX_TREE | Use the unified radix-tree cache implementation. | false |
SGLANG_EXPERIMENTAL_CPP_RADIX_TREE | Use the experimental C++ radix-tree implementation. | false |
SGLANG_RADIX_FORCE_MISS | Force radix-cache misses (debugging/benchmarking). | false |
SGLANG_SCHEDULER_SKIP_ALL_GATHER | Skip the scheduler all-gather step. | false |
SGLANG_ENABLE_WAR_BARRIER | Force-enable the write-after-read barrier for the overlap scheduler even when CUDA is not detected (e.g. AMD/ROCm). On CUDA the barrier is always enabled. | false |
SGLANG_PP_SKIP_PURE_CHUNKED_OUTPUT_COMM | In pipeline parallel, skip output send/recv when a batch is entirely non-final chunked-prefill requests. | false |
SGLANG_KILLPG_ON_SCHEDULER_EXCEPTION | Kill the whole process group when the scheduler raises an exception. | false |
SGLANG_REQUEST_STATE_WAIT_TIMEOUT | Tokenizer-manager request-state wait timeout (seconds). | 4 |
PD Disaggregation (Runtime)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_DISAGGREGATION_THREAD_POOL_SIZE | Thread-pool size for KV transfers. Defaults to a value computed from the CPU count at runtime. | Not set (computed at runtime) |
SGLANG_DISAGGREGATION_QUEUE_SIZE | Disaggregation transfer queue size. | 4 |
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT | Timeout (seconds) for the disaggregation bootstrap handshake. | 300 |
SGLANG_DISAGGREGATION_WAITING_TIMEOUT | Timeout (seconds) for a request waiting on KV transfer. | 300 |
SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL | Interval (seconds) between disaggregation heartbeats. | 5.0 |
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE | Consecutive heartbeat failures tolerated before a peer is considered dead. | 2 |
SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL | Interval (seconds) for cleaning up stale bootstrap entries. | 120 |
SGLANG_DISAGGREGATION_NIXL_BACKEND | NIXL transport backend for disaggregation. | UCX |
SGLANG_DISAGGREGATION_NIXL_BACKEND_PARAMS | JSON parameters passed to the NIXL backend. | |
SGLANG_DISAGGREGATION_ALL_CP_RANKS_TRANSFER | Have all context-parallel ranks participate in KV transfer. | false |
SGLANG_DISAGGREGATION_FORCE_QUERY_PREFILL_DP_RANK | Force querying the prefill DP rank for routing. | false |
SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS | Extra slots in req_to_token_pool for decode workers (effective when max_num_reqs greater than 32), letting more KV transfers overlap decode. | 0 |
Mooncake KV Store & Transfer
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_HICACHE_MOONCAKE_CONFIG_PATH | Path to the HiCache Mooncake store config file. | Not set |
SGLANG_HICACHE_MOONCAKE_REUSE_TE | Reuse the Mooncake transfer engine across HiCache operations. | true |
SGLANG_MOONCAKE_SEND_AUX_TCP | Send Mooncake AUX data over TCP. | false |
SGLANG_ENABLE_FAILED_SESSION_PROBE | Probe failed Mooncake sessions for recovery. | false |
SGLANG_FAILED_SESSION_PROBE_INTERVAL_S | Interval (seconds) between failed-session probes. | 30.0 |
MOONCAKE_MASTER | Address of the Mooncake master. | Not set |
MOONCAKE_CLIENT | Mooncake client identifier. | Not set |
MOONCAKE_LOCAL_HOSTNAME | Local hostname advertised to Mooncake. | localhost |
MOONCAKE_TE_META_DATA_SERVER | Mooncake transfer-engine metadata server. | P2PHANDSHAKE |
MOONCAKE_GLOBAL_SEGMENT_SIZE | Mooncake global segment size. | 4gb |
MOONCAKE_PROTOCOL | Mooncake transport protocol. | rdma |
MOONCAKE_DEVICE | Mooncake RDMA device(s). | "" |
MOONCAKE_MASTER_METRICS_PORT | Port for Mooncake master metrics. | 9003 |
MOONCAKE_CHECK_SERVER | Check connectivity to the Mooncake server on startup. | false |
MOONCAKE_STANDALONE_STORAGE | Run Mooncake in standalone storage mode. | false |
MOONCAKE_ENABLE_SSD_OFFLOAD | Enable SSD offload in Mooncake. | false |
MOONCAKE_OFFLOAD_FILE_STORAGE_PATH | File storage path for Mooncake SSD offload. | Not set |
ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE | Enable Ascend NPU transfers via Mooncake. | false |
ASCEND_NPU_PHY_ID | Physical Ascend NPU id used for Mooncake transfers. -1 auto-detects. | -1 |
Attention & Kernels
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_TRITON_DECODE_ATTN_STATIC_KV_SPLITS | Use static KV splits in the Triton decode-attention kernel. | false |
SGLANG_MUSA_FA3_FORCE_UPDATE_METADATA | Force FA3 metadata updates on MThreads MUSA. | false |
SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK | Skip the sgl-kernel version compatibility check. | false |
Deterministic Inference
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_ENABLE_DETERMINISTIC_INFERENCE | Enable deterministic inference (fixed reduction/accumulation order). | false |
SGLANG_USE_1STAGE_ALLREDUCE | Use the 1-stage all-reduce kernel on AMD (deterministic, fixed accumulation order). If unset, it is auto-enabled when deterministic inference is on. | false |
SGLANG_FLASHINFER_PREFILL_SPLIT_TILE_SIZE | FlashInfer prefill split-tile size for deterministic attention. | 4096 |
SGLANG_FLASHINFER_DECODE_SPLIT_TILE_SIZE | FlashInfer decode split-tile size for deterministic attention. | 2048 |
SGLANG_TRITON_PREFILL_TRUNCATION_ALIGN_SIZE | Triton prefill truncation alignment size for deterministic attention. | 4096 |
SGLANG_TRITON_DECODE_SPLIT_TILE_SIZE | Triton decode split-tile size for deterministic attention. | 256 |
Speculative Decoding & Overlap
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_ENABLE_OVERLAP_PLAN_STREAM | Plan the next step on a separate stream to overlap with the current step (Overlap Spec V2). | false |
SGLANG_SPEC_ENABLE_STRICT_FILTER_CHECK | Enable strict filter checks in speculative decoding. | true |
SGLANG_SPEC_SKIP_ZERO_STEP_DRAFT_EXTEND | Skip draft_extend while adaptive spec is at steps=0; saves a draft forward but the draft KV goes stale. | false |
SGLANG_NGRAM_FORCE_GREEDY_VERIFY | Force greedy verification for the n-gram speculative path. | false |
SGLANG_SANITIZE_NAN_LOGITS | Sanitize NaN logits before sampling kernels and emit a throttled warning. | true |
EPLB (Expert Parallel Load Balancing)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR | Output directory for the expert-distribution recorder. | /tmp |
SGLANG_ENABLE_EPLB_BALANCEDNESS_METRIC | Emit an EPLB balancedness metric. | false |
SGLANG_LOG_EXPERT_LOCATION_METADATA | Log expert-location metadata. | false |
SGLANG_EXPERT_LOCATION_UPDATER_LOG_INPUT | Log inputs to the expert-location updater. | false |
SGLANG_EXPERT_LOCATION_UPDATER_LOG_METRICS | Log metrics from the expert-location updater. | false |
AMD & ROCm
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_USE_AITER_AG | Use the AITER all-gather implementation. | true |
SGLANG_USE_AITER_UNIFIED_ATTN | Use the AITER unified attention kernel. | false |
SGLANG_USE_AITER_FP8_PER_TOKEN | Use AITER FP8 per-token quantization. | false |
SGLANG_USE_AITER_MOE_GU_ITLV | Select the AITER MoE gate/up tile layout: true interleaves, false uses the separated layout. | true |
SGLANG_AITER_FUSE_RMSNORM_PAD | Fuse the residual-add + RMSNorm + zero-pad triplet before the MoE block via the AITER Triton kernel (TP=1, post-attention layernorm path only). | false |
SGLANG_AITER_KV_CACHE_LAYOUT | Physical layout for the MHA KV cache on AITER: nhd or vectorized_5d (SHUFFLE layout enabling pa_decode_gluon). | nhd |
SGLANG_ROCM_FUSED_DECODE_MLA | Use the fused decode MLA kernel on ROCm. | false |
SGLANG_ROCM_DISABLE_LINEARQUANT | Disable linear-layer quantization on ROCm. | false |
NPU (Ascend)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_NPU_DISABLE_ACL_FORMAT_WEIGHT | Disable ACL-format weight conversion on NPU. | false |
SGLANG_NPU_USE_MULTI_STREAM | Use multiple streams on NPU. | false |
SGLANG_NPU_USE_MLAPO | Use the MLAPO path on NPU. | false |
SGLANG_NPU_FUSED_MOE_MODE | Fused MoE mode selector for NPU. | 1 |
SGLANG_NPU_FORWARD_NATIVE_GELUTANH | Use the native gelu-tanh activation forward (for Skywork-Reward-Gemma-2-27B-v0.2). | false |
SGLANG_NPU_FORWARD_NATIVE_GEMMA_RMS_NORM | Use the native Gemma RMSNorm forward (for Skywork-Reward-Gemma-2-27B-v0.2). | false |
SGLANG_USE_AG_AFTER_QLORA | Delay all-gather until after QLoRA for better DeepSeek V3.2 performance. | false |
SGLANG_EXPERIMENTAL_LORA_OPTI | Master switch for the experimental TRT-LLM LoRA fast path. When off, all fine-grained opt switches read false. | false |
SGLANG_ZBAL_LOCAL_MEM_SIZE | Local memory size for the ZBAL (zero-buffer accelerate library) path (NPU only). | 0 |
SGLANG_ZBAL_BOOTSTRAP_URL | Bootstrap URL for the ZBAL path (NPU only). | "" |
Apple Silicon (MLX / MPS)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_USE_MLX | Use the MLX backend on Apple Silicon. | false |
SGLANG_MLX_USE_CUSTOM_ROPE | Use the custom RoPE kernel on MLX. | false |
SGLANG_MLX_FUSE_SWIGLU | Fuse the SwiGLU activation on MLX. | false |
SGLANG_MLX_CLEAR_CACHE_STEPS | Number of decode steps between mx.clear_cache() calls. 0 disables cache clearing. | 256 |
Multimodal (VLM)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_VLM_CACHE_SIZE_MB | Size (MB) of the VLM feature cache. | 100 |
SGLANG_IMAGE_MAX_PIXELS | Maximum number of pixels per image before resizing. | 12845056 |
SGLANG_RESIZE_RESAMPLE | Resampling filter used when resizing images (e.g. bilinear, bicubic). | "" |
SGLANG_MM_SKIP_COMPUTE_HASH | Skip computing multimodal-item hashes. | false |
SGLANG_MM_AVOID_RETOKENIZE | For pre-tokenized (list[int]) multimodal prompts, preserve the user’s original tokens to avoid retokenization drift. | true |
SGLANG_VIT_ENABLE_CUDA_GRAPH | Capture the vision encoder (ViT) in a CUDA graph. | false |
SGLANG_USE_CUDA_IPC_TRANSPORT | Use CUDA IPC transport for multimodal-item tensors. | false |
SGLANG_USE_IPC_POOL_HANDLE_CACHE | Cache CUDA IPC pool handles. | false |
SGLANG_MM_FEATURE_CACHE_MB | Size (MB) of the multimodal feature cache. | 1024 |
SGLANG_MM_ITEM_MEM_POOL_RECYCLE_INTERVAL_SEC | Interval (seconds) for recycling the multimodal-item memory pool. | 0.05 |
Encoder / EPD (Multimodal Disaggregation)
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_ENCODER_MM_RECEIVER_MODE | Encoder receiver selection used by EPD paths: http or grpc. | http |
SGLANG_ENCODER_GRPC_TIMEOUT_SECS | gRPC timeout (seconds) for encoder communication. | 60 |
SGLANG_ENCODER_RECV_TIMEOUT | Encoder receive timeout (seconds). | 180.0 |
SGLANG_ENCODER_SEND_TIMEOUT | Encoder send timeout (seconds). | 180.0 |
SGLANG_ENCODER_HTTP_TIMEOUT | Encoder HTTP timeout (seconds). | 1800.0 |
SGLANG_ENCODER_REQ_TIMEOUT | Encoder per-request timeout (seconds). | 180.0 |
SGLANG_ENCODER_DISPATCH_MIN_ITEMS | Minimum items before the encoder dispatches a batch. | 2 |
SGLANG_ENCODER_MAX_BATCH_SIZE | Maximum encoder batch size. | 8 |
SGLANG_ENCODER_PREPROC_WORKERS | Number of encoder preprocessing workers. | 8 |
SGLANG_ENCODER_IMAGE_PROCESSOR_USE_GPU | Run the image processor on the GPU. | false |
SGLANG_ENCODER_BOOTSTRAP_HEALTH_CHECK_INTERVAL | EncoderBootstrapServer health-check interval (seconds). 0 disables it. | 10.0 |
SGLANG_ENCODER_BOOTSTRAP_HEALTH_CHECK_TIMEOUT | EncoderBootstrapServer health-check timeout (seconds). | 2.0 |
SGLANG_EMBEDDING_POOL_SIZE_MB | Persistent receiver-side GPU embedding pool size (MB) for Mooncake EPD transport. 0 disables (per-request register/deregister). | 4096 |
SGLANG_ENCODER_DP_WORKER_MAX_INFLIGHT | Maximum in-flight requests per encoder DP worker. | 64 |
SGLANG_BACKUP_PORT_BASE | Base port for elastic-EP backup ports. | 10000 |
HTTP & gRPC Server
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_TIMEOUT_KEEP_ALIVE | HTTP keep-alive timeout (seconds). | 5 |
SGLANG_UVICORN_WORKER_HEALTHCHECK_TIMEOUT | Uvicorn multiprocess supervisor per-worker health-check interval (seconds). | 10 |
SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION | Have the /health endpoint run a generation as part of the check. | true |
SGLANG_WARMUP_TIMEOUT | If a warmup forward batch takes longer than this many seconds, the server crashes to avoid hanging. -1 disables; increase (e.g. to 1800) to accommodate kernel JIT precompile. | -1 |
SGLANG_ENABLE_GRPC | Enable the native gRPC server (internal, not yet user-facing). | false |
SGLANG_GRPC_PORT | Port for the native gRPC server. | Not set |
SGLANG_GRANIAN_PARENT_PID | Parent PID for the Granian HTTP/2 worker supervisor. | Not set |
NUMA & CPU
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_NUMA_BIND_V2 | Use the v2 NUMA binding implementation. | true |
SGLANG_AUTO_NUMA_BIND | Automatically bind processes to NUMA nodes. | false |
SGLANG_CRASH_ON_NUMA_BIND_FAILURE | Crash if NUMA binding fails instead of warning. | false |
Metrics
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_ENABLE_METRICS_DEVICE_TIMER | Enable device-timer-based metrics. | false |
SGLANG_ENABLE_METRICS_DP_ATTENTION | Enable data-parallel attention metrics. | false |
External Models
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_EXTERNAL_MODEL_PACKAGE | Python package providing external model implementations. | "" |
SGLANG_EXTERNAL_MM_MODEL_ARCH | External multimodal model architecture name. | "" |
SGLANG_EXTERNAL_MM_PROCESSOR_PACKAGE | Python package providing the external multimodal processor. | "" |
Plugin System
| Environment Variable | Description | Default Value |
|---|---|---|
SGLANG_PLATFORM | Platform plugin name to load. | "" |
SGLANG_PLUGINS | Comma-separated list of plugins to load. | "" |
