1. Model Introduction
FLUX is a family of rectified flow transformer models developed by Black Forest Labs for high-quality image generation from text descriptions. FLUX.1-dev is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. Key Features:- Cutting-edge Output Quality: Second only to the state-of-the-art FLUX.1 [pro] model
- Competitive Prompt Following: Matches the performance of closed-source alternatives
- Guidance Distillation: Trained using guidance distillation for improved efficiency
- Open Weights: Available for personal, scientific, and commercial purposes under the FLUX [dev] Non-Commercial License
- State-of-the-art Performance: Leading open model in text-to-image generation, single-reference editing, and multi-reference editing
- No Finetuning Required: Character, object, and style reference without additional training in one model
- Guidance Distillation: Trained using guidance distillation for improved efficiency
- Open Weights: Available for personal, scientific, and commercial purposes under the FLUX [dev] Non-Commercial License
2. SGLang-diffusion Installation
SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang-diffusion installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
FLUX models are optimized for high-quality image generation. The recommended launch configurations vary by hardware and model version. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model version. SGLang supports serving FLUX on NVIDIA B200, H200, H100, and AMD MI355X, MI325X, MI300X GPUs.3.2 Configuration Tips
Current supported optimization all listed here.--vae-path: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path.--num-gpus: Number of GPUs to use--tp-size: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)--sp-degree: Sequence parallelism size (typically should match the number of GPUs)--ulysses-degree: The degree of DeepSpeed-Ulysses-style SP in USP--ring-degree: The degree of ring attention-style SP in USP
4. API Usage
For complete API documentation, please refer to the official API usage guide.4.1 Generate an Image
Example
4.2 Advanced Usage
4.2.1 Cache-DiT Acceleration
SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can setSGLANG_CACHE_DIT_ENABLED=True to enable it. For more details, please refer to the SGLang Cache-DiT documentation.
Basic Usage
Command
- DBCache Parameters: DBCache controls block-level caching behavior:
| Parameter | Env Variable | Default | Description |
|---|---|---|---|
| Fn | SGLANG_CACHE_DIT_FN | 1 | Number of first blocks to always compute |
| Bn | SGLANG_CACHE_DIT_BN | 0 | Number of last blocks to always compute |
| W | SGLANG_CACHE_DIT_WARMUP | 4 | Warmup steps before caching starts |
| R | SGLANG_CACHE_DIT_RDT | 0.24 | Residual difference threshold |
| MC | SGLANG_CACHE_DIT_MC | 3 | Maximum continuous cached steps |
- TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion:
| Parameter | Env Variable | Default | Description |
|---|---|---|---|
| Enable | SGLANG_CACHE_DIT_TAYLORSEER | false | Enable TaylorSeer calibrator |
| Order | SGLANG_CACHE_DIT_TS_ORDER | 1 | Taylor expansion order (1 or 2) |
Command
4.2.2 CPU Offload
--dit-cpu-offload: Use CPU offload for DiT inference. Enable if run out of memory.--text-encoder-cpu-offload: Use CPU offload for text encoder inference.--vae-cpu-offload: Use CPU offload for VAE.--pin-cpu-memory: Pin memory for CPU offload. Only added as a temp workaround if it throws “CUDA error: invalid argument”.
5. Benchmark
5.1 Speedup Benchmark
5.1.1 Generate a image
Test Environment:- Hardware: NVIDIA B200 GPU (1x)
- Model: black-forest-labs/FLUX.1-dev
- sglang diffusion version: 0.5.6.post2
Command
Command
Output
5.1.2 Generate images with high concurrency
Server Command :Command
Command
Output
