SGLang Diffusion CLI#
Use the CLI for one-off generation with sglang generate or to start a persistent HTTP server with sglang serve.
Overlay repos for non-diffusers models#
If --model-path points to a supported non-diffusers source repo, SGLang can resolve it
through a self-hosted overlay repo.
SGLang first checks a built-in overlay registry. Concrete built-in mappings can be added over time without changing the CLI surface.
Override example:
export SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY='{
"Wan-AI/Wan2.2-S2V-14B": {
"overlay_repo_id": "your-org/Wan2.2-S2V-14B-overlay",
"overlay_revision": "main"
}
}'
sglang generate \
--model-path Wan-AI/Wan2.2-S2V-14B \
--config configs/wan_s2v.yaml
The overlay repo should be a complete diffusers-style/componentized repo
You can also pass the overlay repo itself as --model-path if it contains _overlay/overlay_manifest.json.
Notes:
SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRYis only an optional override for development and debugging. It accepts either a JSON object or a path to a JSON file, and can extend or replace built-in entries for the current process.On the first load, SGLang will:
download overlay metadata from the overlay repo
download the required files from the original source repo
materialize a local standard component repo under
~/.cache/sgl_diffusion/materialized_models/
Later loads reuse the materialized local repo. The materialized repo is what the runtime loads as a normal componentized model directory.
Quick Start#
Generate#
sglang generate \
--model-path Qwen/Qwen-Image \
--prompt "A beautiful sunset over the mountains" \
--save-output
Serve#
sglang serve \
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--num-gpus 4 \
--ulysses-degree 2 \
--ring-degree 2 \
--port 30010
For request and response examples, see OpenAI-Compatible API.
Tip
Use sglang generate --help and sglang serve --help for the full argument list. The CLI help output is the source of truth for exhaustive flags.
Common Options#
Model and runtime#
--model-path {MODEL}: model path or Hugging Face model ID--lora-path {PATH}and--lora-nickname {NAME}: load a LoRA adapter--num-gpus {N}: number of GPUs to use--tp-size {N}: tensor parallelism size, mainly for encoders--sp-degree {N}: sequence parallelism size--ulysses-degree {N}and--ring-degree {N}: USP parallelism controls--attention-backend {BACKEND}: attention backend for native SGLang pipelines--attention-backend-config {CONFIG}: attention backend configuration
Sampling and output#
--prompt {PROMPT}and--negative-prompt {PROMPT}--num-inference-steps {STEPS}and--seed {SEED}--height {HEIGHT},--width {WIDTH},--num-frames {N},--fps {FPS}--output-path {PATH},--output-file-name {NAME},--save-output,--return-frames
For frame interpolation and upscaling, see Post-Processing.
Quantized transformers#
For quantized transformer checkpoints, prefer:
--model-pathfor the base pipeline--transformer-pathfor a quantizedtransformerstransformer component folder--transformer-weights-pathfor a quantized safetensors file, directory, or repo
See Quantization for supported quantization families and examples.
Configuration Files#
Use --config to load JSON or YAML configuration. Command-line flags override values from the config file.
sglang generate --config config.yaml
Example:
model_path: FastVideo/FastHunyuan-diffusers
prompt: A beautiful woman in a red dress walking down a street
output_path: outputs/
num_gpus: 2
sp_size: 2
tp_size: 1
num_frames: 45
height: 720
width: 1280
num_inference_steps: 6
seed: 1024
fps: 24
precision: bf16
vae_precision: fp16
vae_tiling: true
vae_sp: true
enable_torch_compile: false
Generate#
sglang generate runs a single generation job and exits when the job finishes.
sglang generate \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--text-encoder-cpu-offload \
--pin-cpu-memory \
--num-gpus 4 \
--ulysses-degree 2 \
--ring-degree 2 \
--prompt "A curious raccoon" \
--save-output \
--output-path outputs \
--output-file-name "a-curious-raccoon.mp4"
Note
HTTP server-only arguments are ignored by sglang generate.
For diffusers pipelines, Cache-DiT can be enabled with SGLANG_CACHE_DIT_ENABLED=true or --cache-dit-config. See Cache-DiT.
Serve#
sglang serve starts the HTTP server and keeps the model loaded for repeated requests.
sglang serve \
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--text-encoder-cpu-offload \
--pin-cpu-memory \
--num-gpus 4 \
--ulysses-degree 2 \
--ring-degree 2 \
--port 30010
Cloud Storage#
SGLang Diffusion can upload generated images and videos to S3-compatible object storage after generation.
export SGLANG_CLOUD_STORAGE_TYPE=s3
export SGLANG_S3_BUCKET_NAME=my-bucket
export SGLANG_S3_ACCESS_KEY_ID=your-access-key
export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
See Environment Variables for the full set of storage options.
Component Path Overrides#
Override individual pipeline components such as vae, transformer, or text_encoder with --<component>-path.
sglang serve \
--model-path black-forest-labs/FLUX.2-dev \
--vae-path fal/FLUX.2-Tiny-AutoEncoder
The component key must match the key in the model’s model_index.json, and the path must be either a Hugging Face repo ID or a complete component directory.
Diffusers Backend#
Use --backend diffusers to force vanilla diffusers pipelines when no native SGLang implementation exists or when a model requires a custom pipeline class.
Key Options#
Argument |
Values |
Description |
|---|---|---|
|
|
Choose native SGLang, force native, or force diffusers |
|
|
Attention backend for diffusers pipelines |
|
flag |
Required for models with custom pipeline classes |
|
flag |
Lower memory usage for VAE decode |
|
|
Precision controls |
|
flag |
Enable |
|
|
Cache-DiT config for diffusers pipelines |
Example#
sglang generate \
--model-path AIDC-AI/Ovis-Image-7B \
--backend diffusers \
--trust-remote-code \
--diffusers-attention-backend flash \
--prompt "A serene Japanese garden with cherry blossoms" \
--height 1024 \
--width 1024 \
--num-inference-steps 30 \
--save-output \
--output-path outputs \
--output-file-name ovis_garden.png
For pipeline-specific arguments not exposed in the CLI, pass diffusers_kwargs in a config file.