> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# CLI reference

> Run one-off generation tasks and launch the HTTP server from the command line.

Use the CLI for one-off generation with `sglang generate` or to start a persistent HTTP server with `sglang serve`.

### Overlay repos for non-diffusers models

If `--model-path` points to a supported non-diffusers source repo, SGLang can resolve it
through a self-hosted overlay repo.

SGLang first checks a built-in overlay registry. Concrete built-in mappings can be added over time without changing the CLI surface.

Override example:

```bash Command theme={null}
export SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY='{
  "Wan-AI/Wan2.2-S2V-14B": {
    "overlay_repo_id": "your-org/Wan2.2-S2V-14B-overlay",
    "overlay_revision": "main"
  }
}'

sglang generate \
  --model-path Wan-AI/Wan2.2-S2V-14B \
  --config configs/wan_s2v.yaml
```

The overlay repo should be a complete diffusers-style/componentized repo

You can also pass the overlay repo itself as `--model-path` if it contains `_overlay/overlay_manifest.json`.

Notes:

1. `SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY` is only an optional override for
   development and debugging. It accepts either a JSON object or a path to a JSON
   file, and can extend or replace built-in entries for the current process.
2. On the first load, SGLang will:
   * download overlay metadata from the overlay repo
   * download the required files from the original source repo
   * materialize a local standard component repo under `~/.cache/sgl_diffusion/materialized_models/`
3. Later loads reuse the materialized local repo. The materialized repo is what the runtime loads as a normal componentized model directory.

## Quick Start

### Generate

```bash Command theme={null}
sglang generate \
  --model-path Qwen/Qwen-Image \
  --prompt "A beautiful sunset over the mountains" \
  --save-output
```

### Serve

```bash Command theme={null}
sglang serve \
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
  --num-gpus 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --port 30010
```

For request and response examples, see [OpenAI-Compatible API](./openai_api).

<Tip>
  Use `sglang generate --help` and `sglang serve --help` for the full argument list. The CLI help output is the source of truth for exhaustive flags.
</Tip>

## Common Options

### Model and runtime

* `--model-path {MODEL}`: model path or Hugging Face model ID
* `--lora-path {PATH}` and `--lora-nickname {NAME}`: load a LoRA adapter
* `--lora-merge-mode {auto|merge|dynamic}`: choose how LoRA is applied. `auto` statically merges regular weights and uses dynamic LoRA for FSDP-sharded weights to avoid full-gather peaks.
* `--num-gpus {N}`: number of GPUs to use
* `--performance-mode {manual|auto|speed|memory}` / `--mode`: preset for latency/throughput and memory defaults. `auto` is the default and keeps safe offload defaults, using FSDP only for validated DiT-offload replacement paths; `speed` also enables `--enable-torch-compile` by default unless you explicitly disable it. Use `manual` to keep performance-related server args under explicit user control. Explicit offload, FSDP, and parallelism flags take precedence in all modes.
* `--tp-size {N}`: tensor parallelism size, mainly for encoders
* `--sp-degree {N}`: sequence parallelism size
* `--ulysses-degree {N}` and `--ring-degree {N}`: USP parallelism controls
* `--enable-cfg-parallel {true|false}`: enable or explicitly disable CFG parallelism
* `--warmup-mode {off|request|server}`: control startup warmup for `sglang serve`; `off` skips warmup, `request` primes the request path, and `server` runs a full synthetic server warmup before serving traffic
* `--enable-torch-compile {true|false}`: compile native diffusion hot paths. When no warmup mode is configured, this also enables server warmup so first real requests do not pay compile latency.
* `--offload-during-compile {true|false}`: when compile warmup is active, temporarily layerwise-offload DiT weights and move resident non-DiT components off-device so `max-autotune` fits on tighter-memory GPUs; the configured serving residency is restored before real traffic. Skipped under existing layerwise offload, Cache-DiT, or FSDP.
* `--enable-breakable-cuda-graph {true|false}`: capture supported DiT forwards as breakable CUDA graph segments to reduce launch overhead. Requires `--warmup-resolutions` for every served resolution because each resolution is captured separately.
* `--bcg-text-buckets {N...}`: prompt-length padding buckets for breakable CUDA graph capture/replay reuse.
* `--attention-backend {BACKEND}`: attention backend for native SGLang and diffusers pipelines
* `--component-attention-backends {MAP}`: per-component attention backend overrides, for example `text_encoder=torch_sdpa,transformer=fa`
* `--attention-backend-config {CONFIG}`: attention backend configuration
* `--srt-encoder-url {HTTPADDRESS}`: address of SGLang srt server with AR model for GLM-Image like models
* `--srt-encoder-timeout {SECONDS}`: Timeout in seconds for HTTP requests to the SGLang encoder server
* `--srt-encoder-connection-timeout {SECONDS}`: TCP connection timeout in seconds for SGLang encoder server

### Sampling and output

* `--prompt {PROMPT}` and `--negative-prompt {PROMPT}`
* `--image-path {PATH} [{PATH} ...]`: input image(s) for image-to-video or image-to-image generation
* `--num-inference-steps {STEPS}` and `--seed {SEED}`
* `--height {HEIGHT}`, `--width {WIDTH}`, `--num-frames {N}`, `--fps {FPS}`
* `--output-path {PATH}`, `--output-file-name {NAME}`, `--save-output`, `--return-frames`

For frame interpolation and upscaling, see [Post-Processing](./post_processing).

### Quantized transformers

For quantized transformer checkpoints, prefer:

* `--model-path` for the base pipeline
* `--transformer-path` for a quantized `transformers` transformer component folder
* `--transformer-weights-path` for a quantized safetensors file, directory, or repo
* `--quantization` for online quantization (apply quantization to unquantized models at load time, activations are quantized dynamically)
* `--quantization-ignored-layers` layer name patterns to keep unquantized (e.g. `attention.to_`)

See [Quantization](../quantization) for supported quantization families and examples.

### Request logging

* `--log-requests`: Log user-facing fields of all requests (default: `False`). The verbosity is decided by `--log-requests-level`.
* `--log-requests-level &#123;0|1|2|3&#125;`: Verbosity level for request logging (default: `2`). 0: Log metadata (request id). 1: Log metadata and sampling config (seed, steps, guidance, resolution, frames, fps, ...). 2: Log metadata, sampling config and prompt (truncated to 2 KiB). 3: Log metadata, sampling config and full prompt.
* `--log-requests-format &#123;text|json&#125;`: Format for request logging (default: `text`). `text` is human-readable; `json` outputs structured JSON lines.
* `--log-requests-target &#123;TARGET...&#125;`: Target(s) for request logging. Use `stdout` for console output and/or directory path(s) for file output. Can specify multiple targets, e.g., `--log-requests-target stdout /my/log/dir`.

## Configuration Files

Use `--config` to load JSON or YAML configuration. Command-line flags override values from the config file.

```bash Command theme={null}
sglang generate --config config.yaml
```

Example:

```yaml Config theme={null}
model_path: FastVideo/FastHunyuan-diffusers
prompt: A beautiful woman in a red dress walking down a street
output_path: outputs/
num_gpus: 2
sp_degree: 2
tp_size: 1
num_frames: 45
height: 720
width: 1280
num_inference_steps: 6
seed: 1024
fps: 24
precision: bf16
vae_precision: fp16
vae_tiling: true
vae_sp: true
enable_torch_compile: false
```

## Generate

`sglang generate` runs a single generation job and exits when the job finishes.

```bash Command theme={null}
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --text-encoder-cpu-offload \
  --pin-cpu-memory \
  --num-gpus 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --prompt "A curious raccoon" \
  --save-output \
  --output-path outputs \
  --output-file-name "a-curious-raccoon.mp4"
```

<Note>
  HTTP server-only arguments are ignored by `sglang generate`.
</Note>

For diffusers pipelines, Cache-DiT can be enabled with `SGLANG_CACHE_DIT_ENABLED=true` or `--cache-dit-config`. See [Cache-DiT](../cache_dit).

For supported image pipelines, breakable CUDA graph can be enabled with `--enable-breakable-cuda-graph`, but you must declare every served resolution in `--warmup-resolutions` so warmup captures matching graph signatures.

### Layerwise Offload

Use layerwise offload when a large component does not fit comfortably in GPU memory. By default, `--dit-layerwise-offload` only applies to legacy DiT components. Use `--layerwise-offload-components` to select pipeline component names explicitly (`--layerwise-offload-modules` is accepted as an alias):

```bash Command theme={null}
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --dit-layerwise-offload \
  --layerwise-offload-components transformer text_encoder \
  --dit-offload-prefetch-size 0 \
  --prompt "A quiet city street after rain"
```

The values must match keys in the selected pipeline's `pipeline.modules`, such as `transformer`, `text_encoder`, `image_encoder`, `vae`, `condition_image_encoder`, `spatial_upsampler`, or `vocoder`. Use `all` to select every layerwise-offloadable component. Prefer the smallest component set that solves the memory issue because layerwise offload can increase latency.

## Serve

`sglang serve` starts the HTTP server and keeps the model loaded for repeated requests.

```bash Command theme={null}
sglang serve \
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
  --text-encoder-cpu-offload \
  --pin-cpu-memory \
  --num-gpus 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --port 30010
```

### Cloud Storage

SGLang Diffusion can upload generated images and videos to S3-compatible object storage after generation.

```bash Command theme={null}
export SGLANG_CLOUD_STORAGE_TYPE=s3
export SGLANG_S3_BUCKET_NAME=my-bucket
export SGLANG_S3_ACCESS_KEY_ID=your-access-key
export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
```

See [Environment Variables](../environment_variables) for the full set of storage options.

## Component Path Overrides

Override individual pipeline components such as `vae`, `transformer`, or `text_encoder` with `--<component>-path`.

```bash Command theme={null}
sglang serve \
  --model-path black-forest-labs/FLUX.2-dev \
  --vae-path fal/FLUX.2-Tiny-AutoEncoder
```

The component key must match the key in the model's `model_index.json`, and the path must be either a Hugging Face repo ID or a complete component directory.

## Component Attention Backend Overrides

Use `--component-attention-backends` when one pipeline component needs a different native attention backend from the global `--attention-backend`.

```bash Command theme={null}
sglang generate \
  --model-path Lightricks/LTX-2.3 \
  --attention-backend fa \
  --component-attention-backends text_encoder=torch_sdpa
```

The component key must match a pipeline module key such as `text_encoder`, `text_encoder_2`, `transformer`, `transformer_2`, or `connectors`. Component overrides take precedence over the global `--attention-backend` only while that component is being constructed.

You can also pass dotted CLI entries:

```bash Command theme={null}
sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --component-attention-backends.text_encoder torch_sdpa \
  --component-attention-backends.transformer fa
```

## Diffusers Backend

Use `--backend diffusers` to force vanilla diffusers pipelines when no native SGLang implementation exists or when a model requires a custom pipeline class.

### Key Options

<table>
  <thead>
    <tr>
      <th>Argument</th>
      <th>Values</th>
      <th>Description</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td><code>--backend</code></td>
      <td><code>auto</code>, <code>sglang</code>, <code>diffusers</code></td>
      <td>Choose native SGLang, force native, or force diffusers</td>
    </tr>

    <tr>
      <td><code>--attention-backend</code></td>
      <td><code>flash</code>, <code>\_flash\_3\_hub</code>, <code>sage</code>, <code>xformers</code>, <code>native</code></td>
      <td>Attention backend for diffusers pipelines</td>
    </tr>

    <tr>
      <td><code>--trust-remote-code</code></td>
      <td>flag</td>
      <td>Required for models with custom pipeline classes</td>
    </tr>

    <tr>
      <td><code>--vae-tiling</code> and <code>--vae-slicing</code></td>
      <td>flag</td>
      <td>Lower memory usage for VAE decode</td>
    </tr>

    <tr>
      <td><code>--dit-precision</code> and <code>--vae-precision</code></td>
      <td><code>fp16</code>, <code>bf16</code>, <code>fp32</code></td>
      <td>Precision controls</td>
    </tr>

    <tr>
      <td><code>--enable-torch-compile</code></td>
      <td>flag</td>
      <td>Enable <code>torch.compile</code></td>
    </tr>

    <tr>
      <td><code>--cache-dit-config</code></td>

      <td>
        <code>
          {PATH}
        </code>
      </td>

      <td>Cache-DiT config for diffusers pipelines</td>
    </tr>
  </tbody>
</table>

### Example

```bash theme={null}
sglang generate \
  --model-path AIDC-AI/Ovis-Image-7B \
  --backend diffusers \
  --trust-remote-code \
  --attention-backend flash \
  --prompt "A serene Japanese garden with cherry blossoms" \
  --height 1024 \
  --width 1024 \
  --num-inference-steps 30 \
  --save-output \
  --output-path outputs \
  --output-file-name ovis_garden.png
```

For pipeline-specific arguments not exposed in the CLI, pass `diffusers_kwargs` in a config file.
Argument	Values	Description
`--backend`	`auto`, `sglang`, `diffusers`	Choose native SGLang, force native, or force diffusers
`--attention-backend`	`flash`, `\_flash\_3\_hub`, `sage`, `xformers`, `native`	Attention backend for diffusers pipelines
`--trust-remote-code`	flag	Required for models with custom pipeline classes
`--vae-tiling` and `--vae-slicing`	flag	Lower memory usage for VAE decode
`--dit-precision` and `--vae-precision`	`fp16`, `bf16`, `fp32`	Precision controls
`--enable-torch-compile`	flag	Enable `torch.compile`
`--cache-dit-config`	`{PATH}`	Cache-DiT config for diffusers pipelines