Model Loading - SGLang Documentation

--model-path selects the checkpoint to serve; --load-format and the weight-loading flags below control how those weights are read into memory. To stream weights from cloud object storage (S3/GCS/Azure), see Loading Models from Object Storage.

How loading works

SGLang picks a loader from --load-format, falling back to auto-detection from the checkpoint or model path. The default auto loader reads safetensors and falls back to PyTorch .bin.

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --load-format auto

Some formats are auto-detected and override auto:

A Mistral native checkpoint is detected and loaded with mistral.
A .gguf model path is detected and loaded with gguf.
An object storage URI (s3://, gs://, az://) is loaded with runai_streamer.
A remote URI is loaded with remote.

Load formats

Set with --load-format:

Format	Description
`auto`	Default. Load `safetensors` if available, otherwise fall back to the PyTorch `.bin` format.
`safetensors`	Load weights in the safetensors format.
`pt`	Load weights in the PyTorch `.bin` format.
`npcache`	Load PyTorch-format weights and store a numpy cache to speed up subsequent loads. Only supports `.bin` checkpoints.
`dummy`	Initialize weights with random values, for profiling.
`sharded_state`	Each tensor-parallel worker reads only its own pre-sharded shard rather than the full checkpoint, giving a fast load path for large TP models. See `examples/runtime/engine/save_sharded_state.py` for creating a sharded checkpoint.
`fastsafetensors`	Load safetensors using the `fastsafetensors` iterator.
`layered`	Load weights layer by layer, so a layer can be quantized before the next is loaded, lowering the peak memory envelope.
`gguf`	Load weights in the GGUF format. Auto-detected from a `.gguf` model path.
`bitsandbytes`	Load weights using bitsandbytes quantization.
`mistral`	Load a Mistral native-format checkpoint. Auto-detected for such checkpoints.
`flash_rl`	Load a BF16/FP16 checkpoint with native SGLang FP8 quantization for RL training. Requires `—rl-quant-profile`.
`runai_streamer`	Stream weights from SSDs, shared filesystems, or object storage. See Loading Models from Object Storage.
`remote`	Load tensors from a remote KV/filesystem connector. Auto-detected for remote URIs.
`remote_instance`	Pull weights over the network from another running SGLang instance (the “seed”) rather than from disk. Configured with the `—remote-instance-weight-loader-*` flags.

Model loader extra config

--model-loader-extra-config takes a JSON string passed to the loader selected by --load-format.

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}'

Load format	Key	Description	Default
`auto` / `safetensors` / `pt` / `npcache`	`enable_multithread_load` (bool)	Read weight shards with a thread pool instead of sequentially.	`true`
`auto` / `safetensors` / `pt` / `npcache`	`num_threads` (int)	Number of worker threads when multithreaded loading is enabled.	8
`sharded_state`	`pattern` (str)	Filename pattern for per-rank shards.	`model-rank-{rank}-part-{part}.safetensors`
`bitsandbytes`	`qlora_adapter_name_or_path` (str)	QLoRA adapter to apply on top of the bitsandbytes-quantized base weights.	—
`runai_streamer`	`distributed`, `concurrency`, `memory_limit`	Streaming controls. See Loading Models from Object Storage.	See linked page

Weight-loading performance flags

Top-level arguments that tune how safetensors weights are read, independent of --load-format.

Flag	Description	Default
`—download-dir`	Directory used to download and cache Hugging Face model files.	HF default
`—weight-loader-disable-mmap`	Disable mmap while loading safetensors. Can help on filesystems where mmap is slow.	off
`—weight-loader-prefetch-checkpoints`	Prefetch checkpoint files into the OS page cache before loading. Each rank prefetches a fraction of the shards, cutting total network I/O on shared filesystems (NFS/Lustre) from N×checkpoint to 1×checkpoint. Recommended for models on network storage.	off
`—weight-loader-prefetch-num-threads`	Threads per rank for checkpoint prefetching.	4
`—weight-loader-drop-cache-after-load`	Call `posix_fadvise(DONTNEED)` on each safetensors shard after loading it, freeing page cache.	off
`—custom-weight-loader`	Import path(s) of a custom weight-loading function, e.g. `my_package.weight_load_func`.	—

​How loading works

​Load formats

​Model loader extra config

​Weight-loading performance flags

​See also

How loading works

Load formats

Model loader extra config

Weight-loading performance flags

See also