Skip to main content
--model-path selects the checkpoint to serve; --load-format and the weight-loading flags below control how those weights are read into memory. To stream weights from cloud object storage (S3/GCS/Azure), see Loading Models from Object Storage.

How loading works

SGLang picks a loader from --load-format, falling back to auto-detection from the checkpoint or model path. The default auto loader reads safetensors and falls back to PyTorch .bin.
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --load-format auto
Some formats are auto-detected and override auto:
  • A Mistral native checkpoint is detected and loaded with mistral.
  • A .gguf model path is detected and loaded with gguf.
  • An object storage URI (s3://, gs://, az://) is loaded with runai_streamer.
  • A remote URI is loaded with remote.

Load formats

Set with --load-format:
FormatDescription
autoDefault. Load safetensors if available, otherwise fall back to the PyTorch .bin format.
safetensorsLoad weights in the safetensors format.
ptLoad weights in the PyTorch .bin format.
npcacheLoad PyTorch-format weights and store a numpy cache to speed up subsequent loads. Only supports .bin checkpoints.
dummyInitialize weights with random values, for profiling.
sharded_stateEach tensor-parallel worker reads only its own pre-sharded shard rather than the full checkpoint, giving a fast load path for large TP models. See examples/runtime/engine/save_sharded_state.py for creating a sharded checkpoint.
fastsafetensorsLoad safetensors using the fastsafetensors iterator.
layeredLoad weights layer by layer, so a layer can be quantized before the next is loaded, lowering the peak memory envelope.
ggufLoad weights in the GGUF format. Auto-detected from a .gguf model path.
bitsandbytesLoad weights using bitsandbytes quantization.
mistralLoad a Mistral native-format checkpoint. Auto-detected for such checkpoints.
flash_rlLoad a BF16/FP16 checkpoint with native SGLang FP8 quantization for RL training. Requires —rl-quant-profile.
runai_streamerStream weights from SSDs, shared filesystems, or object storage. See Loading Models from Object Storage.
remoteLoad tensors from a remote KV/filesystem connector. Auto-detected for remote URIs.
remote_instancePull weights over the network from another running SGLang instance (the “seed”) rather than from disk. Configured with the —remote-instance-weight-loader-* flags.

Model loader extra config

--model-loader-extra-config takes a JSON string passed to the loader selected by --load-format.
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}'
Load formatKeyDescriptionDefault
auto / safetensors / pt / npcacheenable_multithread_load (bool)Read weight shards with a thread pool instead of sequentially.true
auto / safetensors / pt / npcachenum_threads (int)Number of worker threads when multithreaded loading is enabled.8
sharded_statepattern (str)Filename pattern for per-rank shards.model-rank-{rank}-part-{part}.safetensors
bitsandbytesqlora_adapter_name_or_path (str)QLoRA adapter to apply on top of the bitsandbytes-quantized base weights.
runai_streamerdistributed, concurrency, memory_limitStreaming controls. See Loading Models from Object Storage.See linked page

Weight-loading performance flags

Top-level arguments that tune how safetensors weights are read, independent of --load-format.
FlagDescriptionDefault
—download-dirDirectory used to download and cache Hugging Face model files.HF default
—weight-loader-disable-mmapDisable mmap while loading safetensors. Can help on filesystems where mmap is slow.off
—weight-loader-prefetch-checkpointsPrefetch checkpoint files into the OS page cache before loading. Each rank prefetches a fraction of the shards, cutting total network I/O on shared filesystems (NFS/Lustre) from N×checkpoint to 1×checkpoint. Recommended for models on network storage.off
—weight-loader-prefetch-num-threadsThreads per rank for checkpoint prefetching.4
—weight-loader-drop-cache-after-loadCall posix_fadvise(DONTNEED) on each safetensors shard after loading it, freeing page cache.off
—custom-weight-loaderImport path(s) of a custom weight-loading function, e.g. my_package.weight_load_func.

See also