Environment Variables - SGLang Documentation

SGLang supports various environment variables related to Ascend NPU that can be used to configure its runtime behavior. This document provides a list of commonly used environment variables and aims to stay updated over time.

Directly Used in SGLang

Environment Variable	Description	Default Value
`SGLANG_NPU_USE_MLAPO`	Adopts the `MLAPO` fusion operator in attention preprocessing stage of the MLA model.	`false`
`SGLANG_USE_FIA_NZ`	Reshapes KV Cache for FIA NZ format. `SGLANG_USE_FIA_NZ` must be enabled with `SGLANG_NPU_USE_MLAPO`	`false`
`SGLANG_NPU_USE_MULTI_STREAM`	Enable dual-stream computation of shared experts and routing experts in DeepSeek models. Enable dual-stream computation in DeepSeek DSA Indexer.	`false`
`SGLANG_NPU_DISABLE_ACL_FORMAT_WEIGHT`	Disable cast model weight tensor to a specific NPU ACL format.	`false`
`SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK`	The maximum number of dispatched tokens on each rank.	`128`

Used in DeepEP Ascend

Environment Variable	Description	Default Value
`DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS`	Enable long-sequence token pipelining in dispatch stage. Indicates the number of tokens transmitted per round on each rank.	`8192`
`DEEPEP_NORMAL_LONG_SEQ_ROUND`	Enable long-sequence token pipelining in dispatch stage. Indicates the number of rounds transmitted on each rank.	`1`
`DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ`	Enable long-sequence token pipelining in combine stage. The value `0` means disabled.	`0`
`MOE_ENABLE_TOPK_NEG_ONE`	Needs to be enabled when the expert ID to be processed by DEEPEP contains -1.	`0`
`DEEP_NORMAL_MODE_USE_INT8_QUANT`	When set to `1`, quantizes intermediate activations to INT8 in the DeepEP dispatch operator during normal mode, reducing communication volume for W8A8-quantized MoE models. This variable will become a no-op; the quantization behavior will be inferred automatically.	`0`

Others

Environment Variable	Description	Default Value
`TASK_QUEUE_ENABLE`	Used to control the optimization level of the dispatch queue about the task_queue operator. Detail	`1`
`INF_NAN_MODE_ENABLE`	Controls whether the chip uses saturation mode or INF_NAN mode. Detail	`1`
`STREAMS_PER_DEVICE`	Configures the maximum number of streams for the stream pool. Detail	`32`
`PYTORCH_NPU_ALLOC_CONF`	Controls the behavior of the cache allocator. This variable changes memory usage and may cause performance fluctuations. Detail
`ASCEND_MF_STORE_URL`	The address of config store in MemFabric during PD separation, which is generally set to the IP address of the P primary node with an arbitrary port number.
`ASCEND_LAUNCH_BLOCKING`	Controls whether synchronous mode is enabled during operator execution. Detail	`0`
`HCCL_OP_EXPANSION_MODE`	Configures the expansion position for communication algorithm scheduling. Detail
`HCCL_BUFFSIZE`	Controls the size of the buffer area for shared data between two NPUs. The unit is MB, and the value must be greater than or equal to 1. Detail	`200`
`HCCL_SOCKET_IFNAME`	Configures the name of the network card used by the Host during HCCL initialization. Detail
`GLOO_SOCKET_IFNAME`	Configures the network interface name for GLOO communication.

Ascend NPU Ring-SP Performance (Wan2.1-T2V-1.3B)

Ascend NPU Troubleshooting and FAQ

⌘I