1. Model Introduction
Gemma 4 is Google’s next-generation family of open models, building on the Gemma 3 architecture with improved performance, MoE variants, and multimodal support for text, vision, and audio. Key Features:- Hybrid Attention: Combines sliding window and full attention layers for efficient long-context processing
- Multimodal: Supports text, image, and audio inputs via dedicated vision and audio encoders
- MoE Variant: The 26B-A4B model uses a Mixture-of-Experts architecture for efficient inference
- Per-Layer Embeddings (PLE): Layer-specific token embeddings for enhanced representations
- Reasoning: Built-in thinking mode with
gemma4reasoning parser - Tool Calling: Function call support with streaming via
gemma4tool call parser - Fused Operations: Triton-optimized RMSNorm + residual + scalar kernels
| Model | Architecture | Parameters |
|---|---|---|
| google/gemma-4-E2B-it | Dense | ~2B |
| google/gemma-4-E4B-it | Dense | ~4B |
| google/gemma-4-12B-it | Dense | 12B |
| google/gemma-4-31B-it | Dense | 31B |
| google/gemma-4-26B-A4B-it | MoE | 26B total / 4B active |
2. SGLang Installation
Gemma 4 (including the encoder-free unified 12B, sgl-project/sglang#27167) is supported on SGLang main. Install it together with the matching transformers commit:Command
Docker (prebuilt dev image)
Prebuilt development images bundle SGLang together with the matching transformers commit preinstalled, so no manual install is needed. All tags are multi-arch (amd64 + arm64):
| Tag | CUDA | Hardware |
|---|---|---|
lmsysorg/sglang:dev-gemma-4-12B | 13.0 | Default — amd64 (H200 / B200) + arm64 (GB200 / GB300) |
lmsysorg/sglang:dev-cu13-gemma-4-12B | 13.0 | Alias of the default tag |
lmsysorg/sglang:dev-cu12-gemma-4-12B | 12.9 | CUDA 12.x hosts |
Command
3. Model Deployment
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model variant.3.2 Configuration Tips
- SGLang automatically selects the Triton attention backend for Gemma 4 models (required for bidirectional image-token attention during prefill).
- Attention backend on Blackwell (B200/sm100): SGLang defaults to the
trtllm_mhabackend on sm100, which is fastest for text but applies causal attention to image tokens. For multimodal (image) workloads on B200, pass--attention-backend tritonto restore bidirectional image-token attention and full vision quality. Text-only and audio workloads are unaffected by the default. - For the 26B-A4B MoE model, consider
--tp 2for high-throughput workloads. - Speculative Decoding (MTP): Each Gemma 4 variant ships with a paired
*-assistantdraft model that enables NEXTN multi-token prediction. Enable it via the selector above, or pass--speculative-algorithm NEXTN --speculative-draft-model-path google/gemma-4-<variant>-it-assistant --speculative-num-steps 5 --speculative-num-draft-tokens 6 --speculative-eagle-topk 1. MTP can significantly reduce latency for interactive use cases. The 26B-A4B MoE model requires--tp 2when MTP is enabled. - Hardware requirements:
| Model | Hardware | TP |
|---|---|---|
| gemma-4-E2B-it | 1x H200 / 1x MI300X / 1x MI325X / 1x MI355X | 1 |
| gemma-4-E4B-it | 1x H200 / 1x MI300X / 1x MI325X / 1x MI355X | 1 |
| gemma-4-12B-it | 1x H200 / 1x B200 | 1 |
| gemma-4-31B-it | 2x H200 / 1x MI300X / 1x MI325X / 1x MI355X | 2 (H200) / 1 (AMD) |
| gemma-4-26B-A4B-it | 1x H200 / 1x MI300X / 1x MI325X / 1x MI355X | 1 |
3.3 AMD GPU Deployment (MI300X / MI325X / MI355X)
SGLang automatically selects the correct attention backend on AMD GPUs. For the small E-models (gemma-4-E2B-it, gemma-4-E4B-it), disable AITER on AMD GPUs and use the same command line otherwise:
Command
gemma-4-31B-it and gemma-4-26B-A4B-it, the same commands above work on MI300X, MI325X, and MI355X without additional command-line changes.
Status: AMD benchmarks are available in Section 5.1.
4. Model Invocation
Deploy gemma-4-26B-A4B-it (MoE) with all features enabled:Command
Speculative Decoding (MTP) Server Commands
Each Gemma 4 variant ships with a paired*-assistant draft model for NEXTN multi-token prediction. Use the commands below to enable MTP for the corresponding target model. These match the configuration generated when you toggle Speculative Decoding (MTP) → Enabled in the interactive selector.
Command
Command
Command
Command
Command
4.1 Basic Usage
Example
4.2 Vision Input
Gemma 4 multimodal variants accept images alongside text:Example
4.3 Reasoning (Thinking Mode)
Gemma 4 supports hybrid reasoning. Thinking is not enabled by default — passchat_template_kwargs: {"enable_thinking": true} via extra_body to activate it. The reasoning parser separates thinking and content, returning the thinking process via reasoning_content in the streaming response.
Example
4.4 Tool Calling
Gemma 4 supports function calling with thegemma4 tool call parser. Enable it during deployment with --tool-call-parser gemma4.
Example
4.5 Audio Input
The audio-capable Gemma 4 variants (gemma-4-E2B-it, gemma-4-E4B-it, gemma-4-12B-it) accept raw audio alongside text. Pass the waveform as a base64 audio_url data URI (16 kHz mono WAV works well):
Example
Prompt
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: H200
- SGLang Version: gemma4 branch
gemma-4-E2B-it (1x H200, TP=1)
Server Launch Command:Command
Command
Output
Command
Output
Command
Output
Command
Output
gemma-4-E4B-it (1x H200, TP=1)
Server Launch Command:Command
Output
Output
Output
Output
gemma-4-31B-it (2x H200, TP=2)
Server Launch Command:Command
Output
Output
Output
Output
gemma-4-26B-A4B-it (MoE, 1x H200, TP=1)
Server Launch Command:Command
Tip: Consider --tp 2 for high-throughput workloads.
Latency Benchmark (Text)
Output
Output
Output
Output
gemma-4-31B-it (1x MI300X, TP=1)
Server Launch Command:Command
Note: The 31B dense model fits on a single MI300X (192 GB VRAM) at TP=1, unlike H200 (141 GB) which requires TP=2.Latency Benchmark (Text)
Command
Output
Command
Output
gemma-4-26B-A4B-it (MoE, 1x MI300X, TP=1)
Server Launch Command:Command
Command
Output
Command
Output
gemma-4-12B-it (1x H200, TP=1)
Server Launch Command:Command
Command
Output
Command
Output
Command
Output
Output
gemma-4-12B-it (1x B200, TP=1)
Server Launch Command:Command
Output
Output
Output
Output
Performance tuning: On B200, raising --scheduler-recv-interval to 16 lifted text throughput from 5497 to 5673 tok/s output (≈ +3%) at concurrency 100 with no accuracy change, by reducing the scheduler’s per-step Python overhead. It is a safe, low-risk knob for high-concurrency serving.
5.2 Accuracy Benchmark
Test Environment:- Hardware: H200
- SGLang Version: gemma4 branch
MMLU
| Model | Humanities | Social Sciences | STEM | Other | Overall |
|---|---|---|---|---|---|
| gemma-4-E2B-it | 0.621 | 0.739 | 0.830 | 0.736 | 0.720 |
| gemma-4-E4B-it | 0.703 | 0.862 | 0.902 | 0.825 | 0.810 |
| gemma-4-12B-it | 0.784 | 0.888 | 0.946 | 0.861 | 0.859 |
| gemma-4-31B-it | 0.878 | 0.921 | 0.884 | 0.911 | 0.896 |
| gemma-4-26B-A4B-it | 0.853 | 0.906 | 0.938 | 0.886 | 0.891 |
GSM8K
| Model | Accuracy | Invalid | Latency (s) | Output Throughput (tok/s) |
|---|---|---|---|---|
| gemma-4-E2B-it | 0.170 | 0.000 | 3.990 | 8041.739 |
| gemma-4-E4B-it | 0.745 | 0.000 | 4.174 | 4672.030 |
| gemma-4-12B-it | 0.431 | 0.052 | 55.105 | 6580.229 |
| gemma-4-31B-it | 0.805 | 0.005 | 16.148 | 1559.914 |
| gemma-4-26B-A4B-it | 0.450 | 0.010 | 13.001 | 4089.457 |
Note: These GSM8K numbers use the raw few-shot completion harness (sglang.test.few_shot_gsm8k).gemma-4-12B-itis reasoning-oriented and is under-elicited by raw few-shot prompting; with the chat template it scores 0.950 on the same 1319 GSM8K test questions (sglang.test.run_eval --eval-name gsm8k).
gemma-4-12B-it with sgl-eval
gemma-4-12B-it is reasoning-oriented and answers verbosely (step-by-step) rather than emitting a terse final line. Strict last-line Answer: $LETTER extraction (as in sglang.test.run_eval) therefore undercounts its correct answers. sgl-eval — sgl-project’s evaluation CLI, which uses robust answer extraction — gives a faithful score on the served model:
| Benchmark | Examples | Accuracy |
|---|---|---|
| MMLU | 2000 | 0.878 |
| GSM8K | 1319 | 0.960 |
--base-url points at your endpoint):
Command
MMMU
| Model | Overall |
|---|---|
| gemma-4-E2B-it | 0.307 |
| gemma-4-E4B-it | 0.396 |
| gemma-4-12B-it | 0.683 |
| gemma-4-31B-it | 0.589 |
| gemma-4-26B-A4B-it | 0.549 |
