--model-path selects the checkpoint to serve; --load-format and the weight-loading flags below control how those weights are read into memory. To stream weights from cloud object storage (S3/GCS/Azure), see Loading Models from Object Storage.
How loading works
SGLang picks a loader from--load-format, falling back to auto-detection from the checkpoint or model path. The default auto loader reads safetensors and falls back to PyTorch .bin.
auto:
- A Mistral native checkpoint is detected and loaded with
mistral. - A
.ggufmodel path is detected and loaded withgguf. - An object storage URI (
s3://,gs://,az://) is loaded withrunai_streamer. - A remote URI is loaded with
remote.
Load formats
Set with--load-format:
| Format | Description |
|---|---|
auto | Default. Load safetensors if available, otherwise fall back to the PyTorch .bin format. |
safetensors | Load weights in the safetensors format. |
pt | Load weights in the PyTorch .bin format. |
npcache | Load PyTorch-format weights and store a numpy cache to speed up subsequent loads. Only supports .bin checkpoints. |
dummy | Initialize weights with random values, for profiling. |
sharded_state | Each tensor-parallel worker reads only its own pre-sharded shard rather than the full checkpoint, giving a fast load path for large TP models. See examples/runtime/engine/save_sharded_state.py for creating a sharded checkpoint. |
fastsafetensors | Load safetensors using the fastsafetensors iterator. |
layered | Load weights layer by layer, so a layer can be quantized before the next is loaded, lowering the peak memory envelope. |
gguf | Load weights in the GGUF format. Auto-detected from a .gguf model path. |
bitsandbytes | Load weights using bitsandbytes quantization. |
mistral | Load a Mistral native-format checkpoint. Auto-detected for such checkpoints. |
flash_rl | Load a BF16/FP16 checkpoint with native SGLang FP8 quantization for RL training. Requires —rl-quant-profile. |
runai_streamer | Stream weights from SSDs, shared filesystems, or object storage. See Loading Models from Object Storage. |
remote | Load tensors from a remote KV/filesystem connector. Auto-detected for remote URIs. |
remote_instance | Pull weights over the network from another running SGLang instance (the “seed”) rather than from disk. Configured with the —remote-instance-weight-loader-* flags. |
Model loader extra config
--model-loader-extra-config takes a JSON string passed to the loader selected by --load-format.
| Load format | Key | Description | Default |
|---|---|---|---|
auto / safetensors / pt / npcache | enable_multithread_load (bool) | Read weight shards with a thread pool instead of sequentially. | true |
auto / safetensors / pt / npcache | num_threads (int) | Number of worker threads when multithreaded loading is enabled. | 8 |
sharded_state | pattern (str) | Filename pattern for per-rank shards. | model-rank-{rank}-part-{part}.safetensors |
bitsandbytes | qlora_adapter_name_or_path (str) | QLoRA adapter to apply on top of the bitsandbytes-quantized base weights. | — |
runai_streamer | distributed, concurrency, memory_limit | Streaming controls. See Loading Models from Object Storage. | See linked page |
Weight-loading performance flags
Top-level arguments that tune how safetensors weights are read, independent of--load-format.
| Flag | Description | Default |
|---|---|---|
—download-dir | Directory used to download and cache Hugging Face model files. | HF default |
—weight-loader-disable-mmap | Disable mmap while loading safetensors. Can help on filesystems where mmap is slow. | off |
—weight-loader-prefetch-checkpoints | Prefetch checkpoint files into the OS page cache before loading. Each rank prefetches a fraction of the shards, cutting total network I/O on shared filesystems (NFS/Lustre) from N×checkpoint to 1×checkpoint. Recommended for models on network storage. | off |
—weight-loader-prefetch-num-threads | Threads per rank for checkpoint prefetching. | 4 |
—weight-loader-drop-cache-after-load | Call posix_fadvise(DONTNEED) on each safetensors shard after loading it, freeing page cache. | off |
—custom-weight-loader | Import path(s) of a custom weight-loading function, e.g. my_package.weight_load_func. | — |
