sglang generate or to start a persistent HTTP server with sglang serve.
Overlay repos for non-diffusers models
If--model-path points to a supported non-diffusers source repo, SGLang can resolve it
through a self-hosted overlay repo.
SGLang first checks a built-in overlay registry. Concrete built-in mappings can be added over time without changing the CLI surface.
Override example:
Command
--model-path if it contains _overlay/overlay_manifest.json.
Notes:
SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRYis only an optional override for development and debugging. It accepts either a JSON object or a path to a JSON file, and can extend or replace built-in entries for the current process.- On the first load, SGLang will:
- download overlay metadata from the overlay repo
- download the required files from the original source repo
- materialize a local standard component repo under
~/.cache/sgl_diffusion/materialized_models/
- Later loads reuse the materialized local repo. The materialized repo is what the runtime loads as a normal componentized model directory.
Quick Start
Generate
Command
Serve
Command
Common Options
Model and runtime
--model-path {MODEL}: model path or Hugging Face model ID--lora-path {PATH}and--lora-nickname {NAME}: load a LoRA adapter--lora-merge-mode {auto|merge|dynamic}: choose how LoRA is applied.autostatically merges regular weights and uses dynamic LoRA for FSDP-sharded weights to avoid full-gather peaks.--num-gpus {N}: number of GPUs to use--performance-mode {manual|auto|speed|memory}/--mode: preset for latency/throughput and memory defaults.autois the default and keeps safe offload defaults, using FSDP only for validated DiT-offload replacement paths; usemanualto keep performance-related server args under explicit user control. Explicit offload, FSDP, and parallelism flags take precedence in all modes.--tp-size {N}: tensor parallelism size, mainly for encoders--sp-degree {N}: sequence parallelism size--ulysses-degree {N}and--ring-degree {N}: USP parallelism controls--enable-cfg-parallel {true|false}: enable or explicitly disable CFG parallelism--attention-backend {BACKEND}: attention backend for native SGLang pipelines--component-attention-backends {MAP}: per-component attention backend overrides, for exampletext_encoder=torch_sdpa,transformer=fa--attention-backend-config {CONFIG}: attention backend configuration
Sampling and output
--prompt {PROMPT}and--negative-prompt {PROMPT}--image-path {PATH} [{PATH} ...]: input image(s) for image-to-video or image-to-image generation--num-inference-steps {STEPS}and--seed {SEED}--height {HEIGHT},--width {WIDTH},--num-frames {N},--fps {FPS}--output-path {PATH},--output-file-name {NAME},--save-output,--return-frames
Quantized transformers
For quantized transformer checkpoints, prefer:--model-pathfor the base pipeline--transformer-pathfor a quantizedtransformerstransformer component folder--transformer-weights-pathfor a quantized safetensors file, directory, or repo--quantizationfor online quantization (apply quantization to unquantized models at load time, activations are quantized dynamically)--quantization-ignored-layerslayer name patterns to keep unquantized (e.g.attention.to_)
Configuration Files
Use--config to load JSON or YAML configuration. Command-line flags override values from the config file.
Command
Config
Generate
sglang generate runs a single generation job and exits when the job finishes.
Command
HTTP server-only arguments are ignored by
sglang generate.SGLANG_CACHE_DIT_ENABLED=true or --cache-dit-config. See Cache-DiT.
Layerwise Offload
Use layerwise offload when a large component does not fit comfortably in GPU memory. By default,--dit-layerwise-offload only applies to legacy DiT components. Use --layerwise-offload-components to select pipeline component names explicitly (--layerwise-offload-modules is accepted as an alias):
Command
pipeline.modules, such as transformer, text_encoder, image_encoder, vae, condition_image_encoder, spatial_upsampler, or vocoder. Use all to select every layerwise-offloadable component. Prefer the smallest component set that solves the memory issue because layerwise offload can increase latency.
Serve
sglang serve starts the HTTP server and keeps the model loaded for repeated requests.
Command
Cloud Storage
SGLang Diffusion can upload generated images and videos to S3-compatible object storage after generation.Command
Component Path Overrides
Override individual pipeline components such asvae, transformer, or text_encoder with --<component>-path.
Command
model_index.json, and the path must be either a Hugging Face repo ID or a complete component directory.
Component Attention Backend Overrides
Use--component-attention-backends when one pipeline component needs a different native attention backend from the global --attention-backend.
Command
text_encoder, text_encoder_2, transformer, transformer_2, or connectors. Component overrides take precedence over the global --attention-backend only while that component is being constructed.
You can also pass dotted CLI entries:
Command
Diffusers Backend
Use--backend diffusers to force vanilla diffusers pipelines when no native SGLang implementation exists or when a model requires a custom pipeline class.
Key Options
| Argument | Values | Description |
|---|---|---|
—backend | auto, sglang, diffusers | Choose native SGLang, force native, or force diffusers |
—diffusers-attention-backend | flash, _flash_3_hub, sage, xformers, native | Attention backend for diffusers pipelines |
—trust-remote-code | flag | Required for models with custom pipeline classes |
—vae-tiling and —vae-slicing | flag | Lower memory usage for VAE decode |
—dit-precision and —vae-precision | fp16, bf16, fp32 | Precision controls |
—enable-torch-compile | flag | Enable torch.compile |
—cache-dit-config | {PATH} | Cache-DiT config for diffusers pipelines |
Example
diffusers_kwargs in a config file.