runai_streamer load format to stream model weights directly from cloud storage, significantly reducing startup time and local storage requirements.
Overview
When loading models from object storage, SGLang uses a two-phase approach:- Metadata Download (once, before process launch): Configuration files and tokenizer files are downloaded to a local cache
- Weight Streaming (lazy, during model loading): Model weights are streamed directly from object storage as needed
Supported Storage Backends
- Amazon S3:
s3://bucket-name/path/to/model/ - Google Cloud Storage:
gs://bucket-name/path/to/model/ - Azure Blob:
az://some-azure-container/path/ - S3 compatible:
s3://bucket-name/path/to/model/
Quick Start
Basic Usage
Simply provide an object storage URI as the model path:--load-format runai_streamer is automatically detected when using object storage URIs, so you can omit it:
With Tensor Parallelism
Configuration
Load Format
Therunai_streamer load format is specifically designed for object storage, ssd and shared file systems
Extended Configuration Parameters
Use--model-loader-extra-config to pass additional configuration as a JSON string:
Available Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
distributed | bool | Enable distributed streaming for multi-GPU setups. Automatically set to true for object storage paths and cuda alike devices. | Auto-detected |
concurrency | int | Number of concurrent download streams. Higher values can improve throughput for large models. | 4 |
memory_limit | int | Memory limit (in bytes) for the streaming buffer. | System-dependent |
Performance Considerations
Distributed Streaming
For multi-GPU setups, enable distributed streaming to parallelize weight loading between the processes:Limitations
- Supported Formats: Currently only supports
.safetensorsweight format (recommended format) - Supported Device: Distributed streaming is supported on cuda alike devices. Otherwise fallback to non distributed streaming
