CLI reference - SGLang Documentation

Use the CLI for one-off generation with sglang generate or to start a persistent HTTP server with sglang serve.

Overlay repos for non-diffusers models

If --model-path points to a supported non-diffusers source repo, SGLang can resolve it through a self-hosted overlay repo. SGLang first checks a built-in overlay registry. Concrete built-in mappings can be added over time without changing the CLI surface. Override example:

Command

export SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY='{
  "Wan-AI/Wan2.2-S2V-14B": {
    "overlay_repo_id": "your-org/Wan2.2-S2V-14B-overlay",
    "overlay_revision": "main"
  }
}'

sglang generate \
  --model-path Wan-AI/Wan2.2-S2V-14B \
  --config configs/wan_s2v.yaml

The overlay repo should be a complete diffusers-style/componentized repo You can also pass the overlay repo itself as --model-path if it contains _overlay/overlay_manifest.json. Notes:

SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY is only an optional override for development and debugging. It accepts either a JSON object or a path to a JSON file, and can extend or replace built-in entries for the current process.
On the first load, SGLang will:
- download overlay metadata from the overlay repo
- download the required files from the original source repo
- materialize a local standard component repo under ~/.cache/sgl_diffusion/materialized_models/
Later loads reuse the materialized local repo. The materialized repo is what the runtime loads as a normal componentized model directory.

Quick Start

Generate

Command

sglang generate \
  --model-path Qwen/Qwen-Image \
  --prompt "A beautiful sunset over the mountains" \
  --save-output

Serve

Command

sglang serve \
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
  --num-gpus 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --port 30010

For request and response examples, see OpenAI-Compatible API.

Use sglang generate --help and sglang serve --help for the full argument list. The CLI help output is the source of truth for exhaustive flags.

Common Options

Model and runtime

--model-path {MODEL}: model path or Hugging Face model ID
--lora-path {PATH} and --lora-nickname {NAME}: load a LoRA adapter
--lora-merge-mode {auto|merge|dynamic}: choose how LoRA is applied. auto statically merges regular weights and uses dynamic LoRA for FSDP-sharded weights to avoid full-gather peaks.
--num-gpus {N}: number of GPUs to use
--performance-mode {manual|auto|speed|memory} / --mode: preset for latency/throughput and memory defaults. auto is the default and keeps safe offload defaults, using FSDP only for validated DiT-offload replacement paths; speed also enables --enable-torch-compile by default unless you explicitly disable it. Use manual to keep performance-related server args under explicit user control. Explicit offload, FSDP, and parallelism flags take precedence in all modes.
--tp-size {N}: tensor parallelism size, mainly for encoders
--sp-degree {N}: sequence parallelism size
--ulysses-degree {N} and --ring-degree {N}: USP parallelism controls
--enable-cfg-parallel {true|false}: enable or explicitly disable CFG parallelism
--warmup-mode {off|request|server}: control startup warmup for sglang serve; off skips warmup, request primes the request path, and server runs a full synthetic server warmup before serving traffic
--enable-torch-compile {true|false}: compile native diffusion hot paths. When no warmup mode is configured, this also enables server warmup so first real requests do not pay compile latency.
--offload-during-compile {true|false}: when compile warmup is active, temporarily layerwise-offload DiT weights and move resident non-DiT components off-device so max-autotune fits on tighter-memory GPUs; the configured serving residency is restored before real traffic. Skipped under existing layerwise offload, Cache-DiT, or FSDP.
--enable-breakable-cuda-graph {true|false}: capture supported DiT forwards as breakable CUDA graph segments to reduce launch overhead. Requires --warmup-resolutions for every served resolution because each resolution is captured separately.
--bcg-text-buckets {N...}: prompt-length padding buckets for breakable CUDA graph capture/replay reuse.
--attention-backend {BACKEND}: attention backend for native SGLang and diffusers pipelines
--component-attention-backends {MAP}: per-component attention backend overrides, for example text_encoder=torch_sdpa,transformer=fa
--attention-backend-config {CONFIG}: attention backend configuration
--srt-encoder-url {HTTPADDRESS}: address of SGLang srt server with AR model for GLM-Image like models
--srt-encoder-timeout {SECONDS}: Timeout in seconds for HTTP requests to the SGLang encoder server
--srt-encoder-connection-timeout {SECONDS}: TCP connection timeout in seconds for SGLang encoder server
--pe-server-url {HTTPADDRESS}: url of SGLang server hosting the PE model (e.g., for ERNIE-Image)

Sampling and output

--prompt {PROMPT} and --negative-prompt {PROMPT}
--image-path {PATH} [{PATH} ...]: input image(s) for image-to-video or image-to-image generation
--num-inference-steps {STEPS} and --seed {SEED}
--height {HEIGHT}, --width {WIDTH}, --num-frames {N}, --fps {FPS}
--output-path {PATH}, --output-file-name {NAME}, --save-output, --return-frames

For frame interpolation and upscaling, see Post-Processing.

Quantized transformers

For quantized transformer checkpoints, prefer:

--model-path for the base pipeline
--transformer-path for a quantized transformers transformer component folder
--transformer-weights-path for a quantized safetensors file, directory, or repo
--quantization for online quantization (apply quantization to unquantized models at load time, activations are quantized dynamically)
--quantization-ignored-layers layer name patterns to keep unquantized (e.g. attention.to_)

See Quantization for supported quantization families and examples.

Request logging

--log-requests: Log user-facing fields of all requests (default: False). The verbosity is decided by --log-requests-level.
--log-requests-level {0|1|2|3}: Verbosity level for request logging (default: 2). 0: Log metadata (request id). 1: Log metadata and sampling config (seed, steps, guidance, resolution, frames, fps, …). 2: Log metadata, sampling config and prompt (truncated to 2 KiB). 3: Log metadata, sampling config and full prompt.
--log-requests-format {text|json}: Format for request logging (default: text). text is human-readable; json outputs structured JSON lines.
--log-requests-target {TARGET...}: Target(s) for request logging. Use stdout for console output and/or directory path(s) for file output. Can specify multiple targets, e.g., --log-requests-target stdout /my/log/dir.

Configuration Files

Use --config to load JSON or YAML configuration. Command-line flags override values from the config file.

Command

sglang generate --config config.yaml

Example:

Config

model_path: FastVideo/FastHunyuan-diffusers
prompt: A beautiful woman in a red dress walking down a street
output_path: outputs/
num_gpus: 2
sp_degree: 2
tp_size: 1
num_frames: 45
height: 720
width: 1280
num_inference_steps: 6
seed: 1024
fps: 24
precision: bf16
vae_precision: fp16
vae_tiling: true
vae_sp: true
enable_torch_compile: false

Generate

sglang generate runs a single generation job and exits when the job finishes.

Command

sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --text-encoder-cpu-offload \
  --pin-cpu-memory \
  --num-gpus 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --prompt "A curious raccoon" \
  --save-output \
  --output-path outputs \
  --output-file-name "a-curious-raccoon.mp4"

HTTP server-only arguments are ignored by sglang generate.

For diffusers pipelines, Cache-DiT can be enabled with SGLANG_CACHE_DIT_ENABLED=true or --cache-dit-config. See Cache-DiT. For supported image pipelines, breakable CUDA graph can be enabled with --enable-breakable-cuda-graph, but you must declare every served resolution in --warmup-resolutions so warmup captures matching graph signatures.

Layerwise Offload

Use layerwise offload when a large component does not fit comfortably in GPU memory. By default, --dit-layerwise-offload only applies to legacy DiT components. Use --layerwise-offload-components to select pipeline component names explicitly (--layerwise-offload-modules is accepted as an alias):

Command

sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --dit-layerwise-offload \
  --layerwise-offload-components transformer text_encoder \
  --dit-offload-prefetch-size 0 \
  --prompt "A quiet city street after rain"

The values must match keys in the selected pipeline’s pipeline.modules, such as transformer, text_encoder, image_encoder, vae, condition_image_encoder, spatial_upsampler, or vocoder. Use all to select every layerwise-offloadable component. Prefer the smallest component set that solves the memory issue because layerwise offload can increase latency.

Serve

sglang serve starts the HTTP server and keeps the model loaded for repeated requests.

Command

sglang serve \
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
  --text-encoder-cpu-offload \
  --pin-cpu-memory \
  --num-gpus 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --port 30010

Cloud Storage

SGLang Diffusion can upload generated images and videos to S3-compatible object storage after generation.

Command

export SGLANG_CLOUD_STORAGE_TYPE=s3
export SGLANG_S3_BUCKET_NAME=my-bucket
export SGLANG_S3_ACCESS_KEY_ID=your-access-key
export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com

See Environment Variables for the full set of storage options.

Component Path Overrides

Override individual pipeline components such as vae, transformer, or text_encoder with --<component>-path.

Command

sglang serve \
  --model-path black-forest-labs/FLUX.2-dev \
  --vae-path fal/FLUX.2-Tiny-AutoEncoder

The component key must match the key in the model’s model_index.json, and the path must be either a Hugging Face repo ID or a complete component directory.

Component Attention Backend Overrides

Use --component-attention-backends when one pipeline component needs a different native attention backend from the global --attention-backend.

Command

sglang generate \
  --model-path Lightricks/LTX-2.3 \
  --attention-backend fa \
  --component-attention-backends text_encoder=torch_sdpa

The component key must match a pipeline module key such as text_encoder, text_encoder_2, transformer, transformer_2, or connectors. Component overrides take precedence over the global --attention-backend only while that component is being constructed. You can also pass dotted CLI entries:

Command

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --component-attention-backends.text_encoder torch_sdpa \
  --component-attention-backends.transformer fa

Diffusers Backend

Use --backend diffusers to force vanilla diffusers pipelines when no native SGLang implementation exists or when a model requires a custom pipeline class.

Key Options

Argument	Values	Description
`—backend`	`auto`, `sglang`, `diffusers`	Choose native SGLang, force native, or force diffusers
`—attention-backend`	`flash`, `_flash_3_hub`, `sage`, `xformers`, `native`	Attention backend for diffusers pipelines
`—trust-remote-code`	flag	Required for models with custom pipeline classes
`—vae-tiling` and `—vae-slicing`	flag	Lower memory usage for VAE decode
`—dit-precision` and `—vae-precision`	`fp16`, `bf16`, `fp32`	Precision controls
`—enable-torch-compile`	flag	Enable `torch.compile`
`—cache-dit-config`	Cache-DiT config for diffusers pipelines

Example

sglang generate \
  --model-path AIDC-AI/Ovis-Image-7B \
  --backend diffusers \
  --trust-remote-code \
  --attention-backend flash \
  --prompt "A serene Japanese garden with cherry blossoms" \
  --height 1024 \
  --width 1024 \
  --num-inference-steps 30 \
  --save-output \
  --output-path outputs \
  --output-file-name ovis_garden.png

For pipeline-specific arguments not exposed in the CLI, pass diffusers_kwargs in a config file.

​Overlay repos for non-diffusers models

​Quick Start

​Generate

​Serve

​Common Options

​Model and runtime

​Sampling and output

​Quantized transformers

​Request logging

​Configuration Files

​Generate

​Layerwise Offload

​Serve

​Cloud Storage

​Component Path Overrides

​Component Attention Backend Overrides

​Diffusers Backend

​Key Options

​Example

Overlay repos for non-diffusers models

Quick Start

Generate

Serve

Common Options

Model and runtime

Sampling and output

Quantized transformers

Request logging

Configuration Files

Generate

Layerwise Offload

Serve

Cloud Storage

Component Path Overrides

Component Attention Backend Overrides

Diffusers Backend

Key Options

Example