Loading Models from Object Storage - SGLang Documentation

SGLang can load models directly from object storage without a full local download. It uses the runai_streamer load format to stream model weights from cloud storage, reducing startup time and local storage requirements.

Overview

When loading models from object storage, SGLang uses a two-phase approach:

Metadata Download (once, before process launch): Configuration files and tokenizer files are downloaded to a local cache
Weight Streaming (lazy, during model loading): Model weights are streamed directly from object storage as needed

Supported Storage Backends

Amazon S3: s3://bucket-name/path/to/model/
Google Cloud Storage: gs://bucket-name/path/to/model/
Azure Blob: az://some-azure-container/path/
S3 compatible: s3://bucket-name/path/to/model/

Quick Start

Basic Usage

Simply provide an object storage URI as the model path:

# S3
python -m sglang.launch_server \
  --model-path s3://my-bucket/models/llama-3-8b/ \
  --load-format runai_streamer

# Google Cloud Storage
python -m sglang.launch_server \
  --model-path gs://my-bucket/models/llama-3-8b/ \
  --load-format runai_streamer

Note: The --load-format runai_streamer is automatically detected when using object storage URIs, so you can omit it:

python -m sglang.launch_server \
  --model-path s3://my-bucket/models/llama-3-8b/

With Tensor Parallelism

python -m sglang.launch_server \
  --model-path gs://my-bucket/models/llama-70b/ \
  --tp 4 \
  --model-loader-extra-config '{"distributed": true}'

Configuration

Load Format

The runai_streamer load format is designed for object storage, SSDs, and shared filesystems.

python -m sglang.launch_server \
  --model-path s3://bucket/model/ \
  --load-format runai_streamer

Extended Configuration Parameters

Use --model-loader-extra-config to pass additional configuration as a JSON string:

python -m sglang.launch_server \
  --model-path s3://bucket/model/ \
  --model-loader-extra-config '{
    "distributed": true,
    "concurrency": 8,
    "memory_limit": 2147483648
  }'

Available Parameters

Parameter	Type	Description	Default
`distributed`	bool	Enable distributed streaming for multi-GPU setups. Automatically set to `true` for object storage paths on CUDA-like devices.	Auto-detected
`concurrency`	int	Number of concurrent download streams. Higher values can improve throughput for large models.	4
`memory_limit`	int	Memory limit (in bytes) for the streaming buffer.	System-dependent

Performance Considerations

Distributed Streaming

For multi-GPU setups, enable distributed streaming to parallelize weight loading across processes:

python -m sglang.launch_server \
  --model-path s3://bucket/model/ \
  --tp 8 \
  --model-loader-extra-config '{"distributed": true}'

Limitations

Supported formats: Only the .safetensors weight format is supported.
Supported devices: Distributed streaming is supported on CUDA-like devices; otherwise it falls back to non-distributed streaming.

​Overview

​Supported Storage Backends

​Quick Start

​Basic Usage

​With Tensor Parallelism

​Configuration

​Load Format

​Extended Configuration Parameters

​Available Parameters

​Performance Considerations

​Distributed Streaming

​Limitations

​See Also