1. Model Introduction
Gemma 4 is Google’s next-generation family of open models, building on the Gemma 3 architecture with improved performance, MoE variants, and multimodal support for text, vision, and audio. Key Features:- Hybrid Attention: Combines sliding window and full attention layers for efficient long-context processing
- Multimodal: Supports text, image, and audio inputs via dedicated vision and audio encoders
- MoE Variant: The 26B-A4B model uses a Mixture-of-Experts architecture for efficient inference
- Per-Layer Embeddings (PLE): Layer-specific token embeddings for enhanced representations
- Reasoning: Built-in thinking mode with
gemma4reasoning parser - Tool Calling: Function call support with streaming via
gemma4tool call parser - Fused Operations: Triton-optimized RMSNorm + residual + scalar kernels
| Model | Architecture | Parameters |
|---|---|---|
| google/gemma-4-E2B-it | Dense | ~2B |
| google/gemma-4-E4B-it | Dense | ~4B |
| google/gemma-4-31B-it | Dense | 31B |
| google/gemma-4-26B-A4B-it | MoE | 26B total / 4B active |
2. SGLang Installation
Gemma 4 support requires sgl-project/sglang#21952 and a specific transformers commit:Command
3. Model Deployment
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model variant.3.2 Configuration Tips
- SGLang automatically selects the Triton attention backend for Gemma 4 models (required for bidirectional image-token attention during prefill).
- For the 26B-A4B MoE model, consider
--tp 2for high-throughput workloads. - Hardware requirements:
| Model | Hardware | TP |
|---|---|---|
| gemma-4-E2B-it | 1x H200 / 1x MI300X / 1x MI325X / 1x MI355X | 1 |
| gemma-4-E4B-it | 1x H200 / 1x MI300X / 1x MI325X / 1x MI355X | 1 |
| gemma-4-31B-it | 2x H200 / 1x MI300X / 1x MI325X / 1x MI355X | 2 (H200) / 1 (AMD) |
| gemma-4-26B-A4B-it | 1x H200 / 1x MI300X / 1x MI325X / 1x MI355X | 1 |
3.3 AMD GPU Deployment (MI300X / MI325X / MI355X)
SGLang automatically selects the correct attention backend on AMD GPUs. For the small E-models (gemma-4-E2B-it, gemma-4-E4B-it), disable AITER on AMD GPUs and use the same command line otherwise:
Command
gemma-4-31B-it and gemma-4-26B-A4B-it, the same commands above work on MI300X, MI325X, and MI355X without additional command-line changes.
Status: AMD benchmarks are available in Section 5.1.
4. Model Invocation
Deploy gemma-4-26B-A4B-it (MoE) with all features enabled:Command
4.1 Basic Usage
Example
4.2 Vision Input
Gemma 4 multimodal variants accept images alongside text:Example
4.3 Reasoning (Thinking Mode)
Gemma 4 supports hybrid reasoning. Thinking is not enabled by default — passchat_template_kwargs: {"enable_thinking": true} via extra_body to activate it. The reasoning parser separates thinking and content, returning the thinking process via reasoning_content in the streaming response.
Example
4.4 Tool Calling
Gemma 4 supports function calling with thegemma4 tool call parser. Enable it during deployment with --tool-call-parser gemma4.
Example
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: H200
- SGLang Version: gemma4 branch
gemma-4-E2B-it (1x H200, TP=1)
Server Launch Command:Command
Command
Output
Command
Output
Command
Output
Command
Output
gemma-4-E4B-it (1x H200, TP=1)
Server Launch Command:Command
Output
Output
Output
Output
gemma-4-31B-it (2x H200, TP=2)
Server Launch Command:Command
Output
Output
Output
Output
gemma-4-26B-A4B-it (MoE, 1x H200, TP=1)
Server Launch Command:Command
Tip: Consider --tp 2 for high-throughput workloads.
Latency Benchmark (Text)
Output
Output
Output
Output
gemma-4-31B-it (1x MI300X, TP=1)
Server Launch Command:Command
Note: The 31B dense model fits on a single MI300X (192 GB VRAM) at TP=1, unlike H200 (141 GB) which requires TP=2.Latency Benchmark (Text)
Command
Output
Command
Output
gemma-4-26B-A4B-it (MoE, 1x MI300X, TP=1)
Server Launch Command:Command
Command
Output
Command
Output
5.2 Accuracy Benchmark
Test Environment:- Hardware: H200
- SGLang Version: gemma4 branch
MMLU
| Model | Humanities | Social Sciences | STEM | Other | Overall |
|---|---|---|---|---|---|
| gemma-4-E2B-it | 0.621 | 0.739 | 0.830 | 0.736 | 0.720 |
| gemma-4-E4B-it | 0.703 | 0.862 | 0.902 | 0.825 | 0.810 |
| gemma-4-31B-it | 0.878 | 0.921 | 0.884 | 0.911 | 0.896 |
| gemma-4-26B-A4B-it | 0.853 | 0.906 | 0.938 | 0.886 | 0.891 |
GSM8K
| Model | Accuracy | Invalid | Latency (s) | Output Throughput (tok/s) |
|---|---|---|---|---|
| gemma-4-E2B-it | 0.170 | 0.000 | 3.990 | 8041.739 |
| gemma-4-E4B-it | 0.745 | 0.000 | 4.174 | 4672.030 |
| gemma-4-31B-it | 0.805 | 0.005 | 16.148 | 1559.914 |
| gemma-4-26B-A4B-it | 0.450 | 0.010 | 13.001 | 4089.457 |
MMMU
| Model | Overall |
|---|---|
| gemma-4-E2B-it | 0.307 |
| gemma-4-E4B-it | 0.396 |
| gemma-4-31B-it | 0.589 |
| gemma-4-26B-A4B-it | 0.549 |
