1. Model Introduction
Llama 4 is Meta’s latest generation of open-source LLM model with industry-leading performance. SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since v0.4.5. Ongoing optimizations are tracked in the Roadmap. This generation delivers comprehensive upgrades across the board: The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts. The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts. Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi). For more details, please refer to the official llama4 Repository:https://www.llama.com/models/llama-4/2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities.4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Launch the docker
Command
Command
4.2.2 Launch the server
Llama-4-Scout
8-GPU deployment command:Command
Llama-4-Maverick
8-GPU deployment command:Command
5. Benchmark
5.1 Speed Benchmark
Test Environment: Hardware: AMD MI300x GPU Model: Llama-4-Scout Tensor Parallelism: 8 sglang version: 0.5.9- Model Deployment
Command
5.1.1 Low Concurrency (Latency-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.1.2 Medium Concurrency (Balanced)
- Benchmark Command:
Command
- Test Results:
Output
5.1.3 High Concurrency (Throughput-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.2 Speed Benchmark
Test Environment: Hardware: AMD MI300x GPU Model: Llama-4-Maverick Tensor Parallelism: 8 sglang version: 0.5.9- Model Deployment
Command
5.2.1 Low Concurrency (Latency-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.2.2 Medium Concurrency (Balanced)
- Benchmark Command:
Command
- Test Results:
Output
5.2.3 High Concurrency (Throughput-Optimized)
- Benchmark Command:
Command
- Test Results:
Output
5.3 Accuracy Benchmark
5.3.1 GSM8K Benchmark
- Benchmark Command:
Command
- Llama-4-Scout-17B-16E-Instruct
Output
- Llama-4-Maverick-17B-128E-Instruct
Output
