1. Model Introduction
Llama 4 is Meta’s latest generation of open-source LLM model with industry-leading performance. SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since v0.4.5. Ongoing optimizations are tracked in the Roadmap. This generation delivers comprehensive upgrades across the board: The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts. The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts. Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi). For more details, please refer to the official llama4 Repository:https://www.llama.com/models/llama-4/2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities.3.2 Configuration Tips
- OOM Mitigation: Reduce
--context-lengthto avoid GPU out-of-memory. Recommended: Scout up to 1M on 8×H100, up to 2.5M on 8×H200; Maverick doesn’t need context-length set on 8×H200. With hybrid KV cache enabled, Scout can reach 5M on 8×H100 and 10M on 8×H200. - Attention Backend Auto-Selection: SGLang automatically picks the optimal backend. Manual override with
--attention-backend:- Blackwell (B200/GB200):
trtllm_mha - Hopper (H100/H200):
fa3 - AMD GPUs:
aiter - Intel XPU:
intel_xpu - Other:
triton
- Blackwell (B200/GB200):
- Chat Template: Add
--chat-template llama-4for chat completion tasks. - Multi-Modal: Add
--enable-multimodalto enable image input support. - Hybrid KV Cache: Set
--swa-full-tokens-ratioto control the ratio of SWA (local attention) KV tokens to full-attention KV tokens (default: 0.8, range: 0–1). - EAGLE Speculative Decoding: Supported for Llama 4 Scout and Maverick via EAGLE3. Enable with the interactive command generator above.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Launch the docker
Command
Command
4.2.2 Launch the server
Llama-4-Scout
8-GPU deployment command:Command
Llama-4-Maverick
8-GPU deployment command:Command
4.2.3 EAGLE Speculative Decoding
SGLang supports Llama 4 Maverick (400B) with EAGLE speculative decoding. Enable with the EAGLE3 algorithm and the SGLang EAGLE3 draft model:Command
5. Benchmark
5.1 Speed Benchmark (Scout)
Test Environment: Hardware: AMD MI300x GPU Model: Llama-4-Scout Tensor Parallelism: 8 sglang version: 0.5.9- Model Deployment
Command
5.1.1 Low Concurrency (Latency-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.1.2 Medium Concurrency (Balanced)
- Benchmark Command:
Command
- Test Results:
Output
5.1.3 High Concurrency (Throughput-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.2 Speed Benchmark (Maverick)
Test Environment: Hardware: AMD MI300x GPU Model: Llama-4-Maverick Tensor Parallelism: 8 sglang version: 0.5.9- Model Deployment
Command
5.2.1 Low Concurrency (Latency-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.2.2 Medium Concurrency (Balanced)
- Benchmark Command:
Command
- Test Results:
Output
5.2.3 High Concurrency (Throughput-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.3 Accuracy Benchmark
5.3.1 GSM8K Benchmark
- Benchmark Command:
Command
- Llama-4-Scout-17B-16E-Instruct
Output
- Llama-4-Maverick-17B-128E-Instruct
Output
5.3.2 MMLU Pro with lm-eval
Accuracy on MMLU Pro matches Meta’s official benchmark numbers on 8×H100 (reproduction details: PR #5092):| Model | Official | SGLang |
|---|---|---|
| Llama-4-Scout-17B-16E-Instruct | 74.3 | 75.2 |
| Llama-4-Maverick-17B-128E-Instruct | 80.5 | 80.7 |
Command
Command
