1. Model Introduction
Krea-2 is a high-quality text-to-image diffusion model from Krea. It ships in two variants that share the same backbone and differ only in their sampling recipe:- Krea-2-Turbo - a distilled, few-step model that produces photorealistic images in only 8 inference steps with no classifier-free guidance (
guidance_scale = 1.0), ideal for fast and interactive generation. - Krea-2-Raw - the base (non-distilled) model that trades speed for maximum fidelity, using a longer schedule (~52 steps) with classifier-free guidance (
guidance_scale ≈ 4.5).
model_index.json plus sharded transformer/, text_encoder/, vae/, tokenizer/, and scheduler/ folders). SGLang loads them natively - just point --model-path at the repo, no conversion step required.
Key Features:
- Two variants, one pipeline: switch between fast (Turbo) and high-fidelity (Raw) by changing only the model path and the sampling settings.
- Photorealistic generation at 1024x1024 and other resolutions.
- Native diffusers loading: components (DiT, text encoder, VAE, scheduler) are read straight from the repo’s
model_index.json.
2. SGLang-diffusion Installation
SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang-diffusion installation guide for installation instructions.3. Model Deployment
This section covers deploying Krea-2-Turbo for fast, high-quality image generation.3.1 Basic Configuration
Krea-2-Turbo generates high-quality images in only 8 inference steps. Launch the server with:Command
guidance_scale = 1.0.
3.2 Configuration Tips
Currently supported optimizations are listed here.--num-gpus: Number of GPUs to use.--tp-size: Tensor parallelism size (the recommended multi-GPU path for Krea-2). Its attention heads (48, with 12 KV heads) and text heads (20) are divisible by a tensor-parallel size of 1, 2, or 4.
4. API Usage
For complete API documentation, please refer to the official API usage guide.4.1 Generate an Image
Generate an image with the OpenAI-compatible images API:Example
Command
4.2 Advanced Usage
Krea-2’s DiT is ~24 GB in bf16 (the bulk of the model). On memory-constrained GPUs you can keep less of it resident:--dit-layerwise-offload: stream the DiT’s transformer blocks layer-by-layer with async host-to-device prefetch overlap, so only a small working set stays on the GPU. This is the primary way to fit Krea-2 on a single consumer / 32 GB-class card, at a modest latency cost. Tune the memory/latency trade-off with--dit-offload-prefetch-size(0.0prefetches one layer for the lowest memory; larger values prefetch more layers — faster but more memory).--dit-cpu-offload: keep the whole DiT in host memory. Combine it with--dit-layerwise-offloadfor the lowest peak GPU memory (weights stay on host and only the layers needed for the current step are brought on-device).--text-encoder-cpu-offload: offload the Qwen3-VL text encoder (it is idle during the denoise loop).--vae-cpu-offload: offload the VAE.--pin-cpu-memory: pin host memory for offload. Add only as a temporary workaround if you hitCUDA error: invalid argument.
5. Benchmark
Test Environment:- Hardware: NVIDIA H200 GPU (1x)
- Model: krea/Krea-2-Turbo (8 inference steps)
- sglang diffusion version: 0.5.13
Command
5.1 Generate an image
Benchmark Command:Command
Output
5.2 Generate images with high concurrency
Benchmark Command:Command
Output
