Loading Models from Object Storage

SGLang can load models directly from object storage without a full local download. It uses the runai_streamer load format to stream model weights from cloud storage, reducing startup time and local storage requirements.

Overview

When loading models from object storage, SGLang uses a two-phase approach:

Metadata Download (once, before process launch): Configuration files and tokenizer files are downloaded to a local cache
Weight Streaming (lazy, during model loading): Model weights are streamed directly from object storage as needed

Supported Storage Backends

Amazon S3: s3://bucket-name/path/to/model/
Google Cloud Storage: gs://bucket-name/path/to/model/
Azure Blob: az://some-azure-container/path/
S3 compatible: s3://bucket-name/path/to/model/

Quick Start

Basic Usage

Simply provide an object storage URI as the model path:

# S3
python -m sglang.launch_server \
  --model-path s3://my-bucket/models/llama-3-8b/ \
  --load-format runai_streamer

# Google Cloud Storage
python -m sglang.launch_server \
  --model-path gs://my-bucket/models/llama-3-8b/ \
  --load-format runai_streamer

Note: The --load-format runai_streamer is automatically detected when using object storage URIs, so you can omit it:

python -m sglang.launch_server \
  --model-path s3://my-bucket/models/llama-3-8b/

With Tensor Parallelism

python -m sglang.launch_server \
  --model-path gs://my-bucket/models/llama-70b/ \
  --tp 4 \
  --model-loader-extra-config '{"distributed": true}'

Configuration

Load Format

The runai_streamer load format is designed for object storage, SSDs, and shared filesystems.

python -m sglang.launch_server \
  --model-path s3://bucket/model/ \
  --load-format runai_streamer

Extended Configuration Parameters

Use --model-loader-extra-config to pass additional configuration as a JSON string:

python -m sglang.launch_server \
  --model-path s3://bucket/model/ \
  --model-loader-extra-config '{
    "distributed": true,
    "concurrency": 8,
    "memory_limit": 2147483648
  }'

Available Parameters

Parameter	Type	Description	Default
`distributed`	bool	Enable distributed streaming for multi-GPU setups. Automatically set to `true` for object storage paths on CUDA-like devices.	Auto-detected
`concurrency`	int	Number of concurrent download streams. Higher values can improve throughput for large models.	4
`memory_limit`	int	Memory limit (in bytes) for the streaming buffer.	System-dependent

Performance Considerations

Distributed Streaming

For multi-GPU setups, enable distributed streaming to parallelize weight loading across processes:

python -m sglang.launch_server \
  --model-path s3://bucket/model/ \
  --tp 8 \
  --model-loader-extra-config '{"distributed": true}'

Limitations

Supported formats: Only the .safetensors weight format is supported.
Supported devices: Distributed streaming is supported on CUDA-like devices; otherwise it falls back to non-distributed streaming.

Basic Usage

Advanced Features

Supported Models

Developer Guide

References

Overview

Supported Storage Backends

Quick Start

Basic Usage

With Tensor Parallelism

Configuration

Load Format

Extended Configuration Parameters

Available Parameters

Performance Considerations

Distributed Streaming

Limitations

See Also

​Overview

​Supported Storage Backends

​Quick Start

​Basic Usage

​With Tensor Parallelism

​Configuration

​Load Format

​Extended Configuration Parameters

​Available Parameters

​Performance Considerations

​Distributed Streaming

​Limitations

​See Also

Overview

Supported Storage Backends

Quick Start

Basic Usage

With Tensor Parallelism

Configuration

Load Format

Extended Configuration Parameters

Available Parameters

Performance Considerations

Distributed Streaming

Limitations

See Also