Loading Models from Object Storage#
SGLang supports direct loading of models from object storage (S3 and Google Cloud Storage) without requiring a full local download. This feature uses the runai_streamer load format to stream model weights directly from cloud storage, significantly reducing startup time and local storage requirements.
Overview#
When loading models from object storage, SGLang uses a two-phase approach:
Metadata Download (once, before process launch): Configuration files and tokenizer files are downloaded to a local cache
Weight Streaming (lazy, during model loading): Model weights are streamed directly from object storage as needed
Supported Storage Backends#
Amazon S3:
s3://bucket-name/path/to/model/Google Cloud Storage:
gs://bucket-name/path/to/model/Azure Blob:
az://some-azure-container/path/S3 compatible:
s3://bucket-name/path/to/model/
Quick Start#
Basic Usage#
Simply provide an object storage URI as the model path:
# S3
python -m sglang.launch_server \
--model-path s3://my-bucket/models/llama-3-8b/ \
--load-format runai_streamer
# Google Cloud Storage
python -m sglang.launch_server \
--model-path gs://my-bucket/models/llama-3-8b/ \
--load-format runai_streamer
Note: The --load-format runai_streamer is automatically detected when using object storage URIs, so you can omit it:
python -m sglang.launch_server \
--model-path s3://my-bucket/models/llama-3-8b/
With Tensor Parallelism#
python -m sglang.launch_server \
--model-path gs://my-bucket/models/llama-70b/ \
--tp 4 \
--model-loader-extra-config '{"distributed": true}'
Configuration#
Load Format#
The runai_streamer load format is specifically designed for object storage, ssd and shared file systems
python -m sglang.launch_server \
--model-path s3://bucket/model/ \
--load-format runai_streamer
Extended Configuration Parameters#
Use --model-loader-extra-config to pass additional configuration as a JSON string:
python -m sglang.launch_server \
--model-path s3://bucket/model/ \
--model-loader-extra-config '{
"distributed": true,
"concurrency": 8,
"memory_limit": 2147483648
}'
Available Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable distributed streaming for multi-GPU setups. Automatically set to |
Auto-detected |
|
int |
Number of concurrent download streams. Higher values can improve throughput for large models. |
4 |
|
int |
Memory limit (in bytes) for the streaming buffer. |
System-dependent |
Performance Considerations#
Distributed Streaming#
For multi-GPU setups, enable distributed streaming to parallelize weight loading between the processes:
python -m sglang.launch_server \
--model-path s3://bucket/model/ \
--tp 8 \
--model-loader-extra-config '{"distributed": true}'
Limitations#
Supported Formats: Currently only supports
.safetensorsweight format (recommended format)Supported Device: Distributed streaming is supported on cuda alike devices. Otherwise fallback to non distributed streaming