Skip to main content

1. Model Introduction

Gemma 4 is Google’s next-generation family of open models, building on the Gemma 3 architecture with improved performance, MoE variants, and multimodal support for text, vision, and audio. Key Features:
  • Hybrid Attention: Combines sliding window and full attention layers for efficient long-context processing
  • Multimodal: Supports text, image, and audio inputs via dedicated vision and audio encoders
  • MoE Variant: The 26B-A4B model uses a Mixture-of-Experts architecture for efficient inference
  • Per-Layer Embeddings (PLE): Layer-specific token embeddings for enhanced representations
  • Reasoning: Built-in thinking mode with gemma4 reasoning parser
  • Tool Calling: Function call support with streaming via gemma4 tool call parser
  • Fused Operations: Triton-optimized RMSNorm + residual + scalar kernels
Available Models:
ModelArchitectureParameters
google/gemma-4-E2B-itDense~2B
google/gemma-4-E4B-itDense~4B
google/gemma-4-12B-itDense12B
google/gemma-4-31B-itDense31B
google/gemma-4-26B-A4B-itMoE26B total / 4B active

2. SGLang Installation

Gemma 4 (including the encoder-free unified 12B, sgl-project/sglang#27167) is supported on SGLang main. Install it together with the matching transformers commit:
Command
# Install SGLang from main
pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'

# Install transformers with Gemma 4 support (encoder-free unified family included)
pip install 'git+https://github.com/huggingface/transformers.git@1423d22f7a3b62e8c70ad67b58ec25cd9b675897'

Docker (prebuilt dev image)

Prebuilt development images bundle SGLang together with the matching transformers commit preinstalled, so no manual install is needed. All tags are multi-arch (amd64 + arm64):
TagCUDAHardware
lmsysorg/sglang:dev-gemma-4-12B13.0Default — amd64 (H200 / B200) + arm64 (GB200 / GB300)
lmsysorg/sglang:dev-cu13-gemma-4-12B13.0Alias of the default tag
lmsysorg/sglang:dev-cu12-gemma-4-12B12.9CUDA 12.x hosts
Command
docker run --gpus all --ipc=host --shm-size 32g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 30000:30000 \
  lmsysorg/sglang:dev-gemma-4-12B \
  sglang serve --model-path google/gemma-4-12B-it \
    --reasoning-parser gemma4 --tool-call-parser gemma4 \
    --host 0.0.0.0 --port 30000
For other installation methods, please refer to the official SGLang installation guide.

3. Model Deployment

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model variant.

3.2 Configuration Tips

  • SGLang automatically selects the Triton attention backend for Gemma 4 models (required for bidirectional image-token attention during prefill).
  • Attention backend on Blackwell (B200/sm100): SGLang defaults to the trtllm_mha backend on sm100, which is fastest for text but applies causal attention to image tokens. For multimodal (image) workloads on B200, pass --attention-backend triton to restore bidirectional image-token attention and full vision quality. Text-only and audio workloads are unaffected by the default.
  • For the 26B-A4B MoE model, consider --tp 2 for high-throughput workloads.
  • Speculative Decoding (MTP): Each Gemma 4 variant ships with a paired *-assistant draft model that enables NEXTN multi-token prediction. Enable it via the selector above, or pass --speculative-algorithm NEXTN --speculative-draft-model-path google/gemma-4-<variant>-it-assistant --speculative-num-steps 5 --speculative-num-draft-tokens 6 --speculative-eagle-topk 1. MTP can significantly reduce latency for interactive use cases. The 26B-A4B MoE model requires --tp 2 when MTP is enabled.
  • Hardware requirements:
ModelHardwareTP
gemma-4-E2B-it1x H200 / 1x MI300X / 1x MI325X / 1x MI355X1
gemma-4-E4B-it1x H200 / 1x MI300X / 1x MI325X / 1x MI355X1
gemma-4-12B-it1x H200 / 1x B2001
gemma-4-31B-it2x H200 / 1x MI300X / 1x MI325X / 1x MI355X2 (H200) / 1 (AMD)
gemma-4-26B-A4B-it1x H200 / 1x MI300X / 1x MI325X / 1x MI355X1

3.3 AMD GPU Deployment (MI300X / MI325X / MI355X)

SGLang automatically selects the correct attention backend on AMD GPUs. For the small E-models (gemma-4-E2B-it, gemma-4-E4B-it), disable AITER on AMD GPUs and use the same command line otherwise:
Command
SGLANG_USE_AITER=0 sglang serve --model-path google/gemma-4-E4B-it \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000
For gemma-4-31B-it and gemma-4-26B-A4B-it, the same commands above work on MI300X, MI325X, and MI355X without additional command-line changes.
Status: AMD benchmarks are available in Section 5.1.

4. Model Invocation

Deploy gemma-4-26B-A4B-it (MoE) with all features enabled:
Command
sglang serve --model-path google/gemma-4-26B-A4B-it \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

Speculative Decoding (MTP) Server Commands

Each Gemma 4 variant ships with a paired *-assistant draft model for NEXTN multi-token prediction. Use the commands below to enable MTP for the corresponding target model. These match the configuration generated when you toggle Speculative Decoding (MTP) → Enabled in the interactive selector.
Command
# Gemma 4 E2B + MTP
sglang serve \
  --model-path google/gemma-4-E2B-it \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-E2B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85
Command
# Gemma 4 E4B + MTP
sglang serve \
  --model-path google/gemma-4-E4B-it \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-E4B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85
Command
# Gemma 4 12B + MTP (~35% faster single-stream decode on H200)
sglang serve \
  --model-path google/gemma-4-12B-it \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-12B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85
Command
# Gemma 4 31B + MTP
sglang serve \
  --model-path google/gemma-4-31B-it \
  --tp-size 2 \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-31B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85
Command
# Gemma 4 26B-A4B + MTP
sglang serve \
  --model-path google/gemma-4-26B-A4B-it \
  --tp-size 2 \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-26B-A4B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85

4.1 Basic Usage

Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "What are the key differences between TCP and UDP?"}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

4.2 Vision Input

Gemma 4 multimodal variants accept images alongside text:
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this image in detail."
                }
            ]
        }
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

4.3 Reasoning (Thinking Mode)

Gemma 4 supports hybrid reasoning. Thinking is not enabled by default — pass chat_template_kwargs: {"enable_thinking": true} via extra_body to activate it. The reasoning parser separates thinking and content, returning the thinking process via reasoning_content in the streaming response.
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "Solve step by step: If a train travels at 60 km/h for 2.5 hours, how far does it go?"}
    ],
    max_tokens=4096,
    stream=True,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

thinking_started = False
has_thinking = False
has_answer = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

4.4 Tool Calling

Gemma 4 supports function calling with the gemma4 tool call parser. Enable it during deployment with --tool-call-parser gemma4.
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ],
    tools=tools,
    stream=True
)

thinking_started = False
has_thinking = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            if has_thinking and thinking_started:
                print("\n=============== Tool Calls ================", flush=True)
                thinking_started = False
            for tool_call in delta.tool_calls:
                if tool_call.function:
                    print(f"Tool Call: {tool_call.function.name}")
                    print(f"   Arguments: {tool_call.function.arguments}")

        if delta.content:
            print(delta.content, end="", flush=True)

print()

4.5 Audio Input

The audio-capable Gemma 4 variants (gemma-4-E2B-it, gemma-4-E4B-it, gemma-4-12B-it) accept raw audio alongside text. Pass the waveform as a base64 audio_url data URI (16 kHz mono WAV works well):
Example
import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

with open("sample.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="google/gemma-4-12B-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
                {"type": "text", "text": "Transcribe the speech in this audio exactly."},
            ],
        }
    ],
    max_tokens=256,
    temperature=0,
)

print(response.choices[0].message.content)
For best ASR quality, use the recommended transcription prompt structure:
Prompt
Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.
For speech translation (AST), ask for the transcription in the source language first, then the translation: “Transcribe the following speech segment in , then translate it into . …“

5. Benchmark

5.1 Speed Benchmark

Test Environment:
  • Hardware: H200
  • SGLang Version: gemma4 branch

gemma-4-E2B-it (1x H200, TP=1)

Server Launch Command:
Command
sglang serve --model-path google/gemma-4-E2B-it
Latency Benchmark (Text)
Command
python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 10 --max-concurrency 1
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  17.44
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.57
Output token throughput (tok/s):         242.03
Total token throughput (tok/s):          591.94
Mean TTFT (ms):                          50.19
Median TTFT (ms):                        54.22
Mean TPOT (ms):                          3.99
Median ITL (ms):                         4.05
==================================================
Latency Benchmark (Image)
Command
python3 -m sglang.bench_serving --backend sglang-oai-chat \
  --host 0.0.0.0 --port 30000 \
  --dataset-name image --image-count 2 --image-resolution 720p \
  --random-input-len 128 --random-output-len 1024 \
  --num-prompts 10 --max-concurrency 1
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  18.05
Total input tokens:                      6097
Total input vision tokens:               5340
Total generated tokens:                  4220
Request throughput (req/s):              0.55
Output token throughput (tok/s):         233.84
Total token throughput (tok/s):          571.69
Mean TTFT (ms):                          109.59
Median TTFT (ms):                        112.62
Mean TPOT (ms):                          4.01
Median ITL (ms):                         4.04
==================================================
Throughput Benchmark (Text)
Command
python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 1000 --max-concurrency 100
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  51.73
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              19.33
Output token throughput (tok/s):         9876.36
Peak output token throughput (tok/s):    13863.00
Total token throughput (tok/s):          19791.14
Mean TTFT (ms):                          86.57
Mean TPOT (ms):                          9.56
Median ITL (ms):                         5.99
==================================================
Throughput Benchmark (Image)
Command
python3 -m sglang.bench_serving --backend sglang-oai-chat \
  --host 0.0.0.0 --port 30000 \
  --dataset-name image --image-count 2 --image-resolution 720p \
  --random-input-len 128 --random-output-len 1024 \
  --num-prompts 1000 --max-concurrency 100
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  89.07
Total input tokens:                      617799
Total input vision tokens:               534000
Total generated tokens:                  510855
Request throughput (req/s):              11.23
Output token throughput (tok/s):         5735.75
Peak output token throughput (tok/s):    12823.00
Total token throughput (tok/s):          12672.23
Mean TTFT (ms):                          636.46
Mean TPOT (ms):                          16.34
Median ITL (ms):                         5.68
==================================================

gemma-4-E4B-it (1x H200, TP=1)

Server Launch Command:
Command
sglang serve --model-path google/gemma-4-E4B-it
Latency Benchmark (Text)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  24.49
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.41
Output token throughput (tok/s):         172.32
Total token throughput (tok/s):          421.45
Mean TTFT (ms):                          52.76
Median TTFT (ms):                        53.66
Mean TPOT (ms):                          5.64
Median ITL (ms):                         5.74
==================================================
Latency Benchmark (Image)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  25.04
Total input tokens:                      6124
Total input vision tokens:               5340
Total generated tokens:                  4220
Request throughput (req/s):              0.40
Output token throughput (tok/s):         168.54
Total token throughput (tok/s):          413.13
Mean TTFT (ms):                          110.15
Median TTFT (ms):                        108.24
Mean TPOT (ms):                          5.66
Median ITL (ms):                         5.73
==================================================
Throughput Benchmark (Text)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  72.95
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              13.71
Output token throughput (tok/s):         7002.68
Peak output token throughput (tok/s):    9878.00
Total token throughput (tok/s):          14032.60
Mean TTFT (ms):                          166.33
Mean TPOT (ms):                          13.36
Median ITL (ms):                         8.88
==================================================
Throughput Benchmark (Image)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  108.99
Total input tokens:                      616952
Total input vision tokens:               534000
Total generated tokens:                  510855
Request throughput (req/s):              9.18
Output token throughput (tok/s):         4687.38
Peak output token throughput (tok/s):    9277.00
Total token throughput (tok/s):          10348.25
Mean TTFT (ms):                          626.17
Mean TPOT (ms):                          20.00
Median ITL (ms):                         8.64
==================================================

gemma-4-31B-it (2x H200, TP=2)

Server Launch Command:
Command
sglang serve --model-path google/gemma-4-31B-it --tp 2
Latency Benchmark (Text)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  53.05
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.19
Output token throughput (tok/s):         79.55
Total token throughput (tok/s):          194.55
Mean TTFT (ms):                          72.77
Median TTFT (ms):                        75.05
Mean TPOT (ms):                          12.32
Median ITL (ms):                         12.53
==================================================
Latency Benchmark (Image)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  53.78
Total input tokens:                      6162
Total input vision tokens:               5340
Total generated tokens:                  4220
Request throughput (req/s):              0.19
Output token throughput (tok/s):         78.46
Total token throughput (tok/s):          193.03
Mean TTFT (ms):                          143.35
Median TTFT (ms):                        146.85
Mean TPOT (ms):                          12.37
Median ITL (ms):                         12.48
==================================================
Throughput Benchmark (Text)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  182.00
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              5.49
Output token throughput (tok/s):         2806.82
Peak output token throughput (tok/s):    3798.00
Total token throughput (tok/s):          5624.56
Mean TTFT (ms):                          324.67
Mean TPOT (ms):                          33.95
Median ITL (ms):                         25.44
==================================================
Throughput Benchmark (Image)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  236.46
Total input tokens:                      621630
Total input vision tokens:               534000
Total generated tokens:                  510855
Request throughput (req/s):              4.23
Output token throughput (tok/s):         2160.42
Peak output token throughput (tok/s):    3745.00
Total token throughput (tok/s):          4789.30
Mean TTFT (ms):                          952.02
Mean TPOT (ms):                          44.17
Median ITL (ms):                         26.81
==================================================

gemma-4-26B-A4B-it (MoE, 1x H200, TP=1)

Server Launch Command:
Command
sglang serve --model-path google/gemma-4-26B-A4B-it
Tip: Consider --tp 2 for high-throughput workloads.
Latency Benchmark (Text)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  25.00
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.40
Output token throughput (tok/s):         168.81
Total token throughput (tok/s):          412.85
Mean TTFT (ms):                          103.74
Median TTFT (ms):                        46.57
Mean TPOT (ms):                          5.60
Median ITL (ms):                         5.78
==================================================
Latency Benchmark (Image)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  25.31
Total input tokens:                      6164
Total input vision tokens:               5340
Total generated tokens:                  4220
Request throughput (req/s):              0.40
Output token throughput (tok/s):         166.70
Total token throughput (tok/s):          410.20
Mean TTFT (ms):                          129.22
Median TTFT (ms):                        132.54
Mean TPOT (ms):                          5.68
Median ITL (ms):                         5.75
==================================================
Throughput Benchmark (Text)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  138.98
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              7.20
Output token throughput (tok/s):         3675.81
Peak output token throughput (tok/s):    4799.00
Total token throughput (tok/s):          7365.91
Mean TTFT (ms):                          153.77
Mean TPOT (ms):                          25.95
Median ITL (ms):                         20.23
==================================================
Throughput Benchmark (Image)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  186.38
Total input tokens:                      621146
Total input vision tokens:               534000
Total generated tokens:                  510855
Request throughput (req/s):              5.37
Output token throughput (tok/s):         2740.86
Peak output token throughput (tok/s):    4962.00
Total token throughput (tok/s):          6073.47
Mean TTFT (ms):                          854.71
Mean TPOT (ms):                          34.64
Median ITL (ms):                         19.08
==================================================

gemma-4-31B-it (1x MI300X, TP=1)

Server Launch Command:
Command
sglang serve --model-path google/gemma-4-31B-it
Note: The 31B dense model fits on a single MI300X (192 GB VRAM) at TP=1, unlike H200 (141 GB) which requires TP=2.
Latency Benchmark (Text)
Command
python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 10 --max-concurrency 1
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  103.55
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.10
Output token throughput (tok/s):         40.75
Total token throughput (tok/s):          99.67
Mean TTFT (ms):                          152.35
Median TTFT (ms):                        169.66
Mean TPOT (ms):                          24.13
Median ITL (ms):                         24.23
==================================================
Throughput Benchmark (Text)
Command
python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 1000 --max-concurrency 100
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  441.59
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              2.26
Output token throughput (tok/s):         1156.85
Peak output token throughput (tok/s):    1759.00
Total token throughput (tok/s):          2318.19
Mean TTFT (ms):                          819.22
Mean TPOT (ms):                          82.51
Median ITL (ms):                         63.45
==================================================

gemma-4-26B-A4B-it (MoE, 1x MI300X, TP=1)

Server Launch Command:
Command
sglang serve --model-path google/gemma-4-26B-A4B-it
Latency Benchmark (Text)
Command
python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 10 --max-concurrency 1
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  43.73
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.23
Output token throughput (tok/s):         96.49
Total token throughput (tok/s):          236.00
Mean TTFT (ms):                          185.58
Median TTFT (ms):                        90.18
Mean TPOT (ms):                          9.78
Median ITL (ms):                         9.57
==================================================
Throughput Benchmark (Text)
Command
python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 1000 --max-concurrency 100
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  219.43
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              4.56
Output token throughput (tok/s):         2328.05
Peak output token throughput (tok/s):    3500.00
Total token throughput (tok/s):          4665.16
Mean TTFT (ms):                          168.44
Mean TPOT (ms):                          41.23
Median ITL (ms):                         29.31
==================================================

gemma-4-12B-it (1x H200, TP=1)

Server Launch Command:
Command
sglang serve --model-path google/gemma-4-12B-it
Latency Benchmark (Text)
Command
python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 10 --max-concurrency 1
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  38.66
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.26
Output token throughput (tok/s):         109.15
Total token throughput (tok/s):          266.94
Mean TTFT (ms):                          33.08
Median TTFT (ms):                        33.71
Mean TPOT (ms):                          9.02
Median ITL (ms):                         9.19
==================================================
Latency Benchmark (Image)
Command
python3 -m sglang.bench_serving --backend sglang-oai-chat \
  --host 0.0.0.0 --port 30000 \
  --dataset-name image --image-count 2 --image-resolution 720p \
  --random-input-len 128 --random-output-len 1024 \
  --num-prompts 10 --max-concurrency 1
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  39.36
Total input vision tokens:               5320
Total generated tokens:                  4220
Request throughput (req/s):              0.25
Output token throughput (tok/s):         107.23
Total token throughput (tok/s):          263.62
Mean TTFT (ms):                          94.98
Median TTFT (ms):                        97.33
Mean TPOT (ms):                          9.08
Median ITL (ms):                         9.17
==================================================
Throughput Benchmark (Text)
Command
python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 1000 --max-concurrency 100
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  130.44
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              7.67
Output token throughput (tok/s):         3916.46
Total token throughput (tok/s):          7848.15
Mean TTFT (ms):                          207.49
Median TTFT (ms):                        76.95
Mean TPOT (ms):                          24.38
Median ITL (ms):                         17.89
==================================================
Throughput Benchmark (Image)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  147.57
Total input tokens:                      619609
Total input vision tokens:               532000
Total generated tokens:                  510855
Request throughput (req/s):              6.78
Output token throughput (tok/s):         3461.79
Total token throughput (tok/s):          7660.54
Mean TTFT (ms):                          438.40
Median TTFT (ms):                        129.83
Mean TPOT (ms):                          27.12
Median ITL (ms):                         19.16
==================================================

gemma-4-12B-it (1x B200, TP=1)

Server Launch Command:
Command
# Text/audio: the sm100 default (trtllm_mha) is fastest.
# For image workloads add --attention-backend triton (bidirectional image attention).
sglang serve --model-path google/gemma-4-12B-it --attention-backend triton
Latency Benchmark (Text)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  30.46
Output token throughput (tok/s):         138.55
Total token throughput (tok/s):          338.85
Mean TTFT (ms):                          28.14
Median TTFT (ms):                        29.74
Mean TPOT (ms):                          7.08
Median ITL (ms):                         7.26
==================================================
Latency Benchmark (Image)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  31.43
Total input vision tokens:               5320
Total generated tokens:                  4220
Request throughput (req/s):              0.32
Output token throughput (tok/s):         134.26
Total token throughput (tok/s):          329.57
Mean TTFT (ms):                          115.51
Median TTFT (ms):                        74.27
Mean TPOT (ms):                          7.14
Median ITL (ms):                         7.24
==================================================
Throughput Benchmark (Text)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  92.94
Request throughput (req/s):              10.76
Output token throughput (tok/s):         5496.55
Total token throughput (tok/s):          11014.49
Mean TTFT (ms):                          120.89
Median TTFT (ms):                        45.00
Mean TPOT (ms):                          17.23
Median ITL (ms):                         14.30
==================================================
Throughput Benchmark (Image)
Output
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Max request concurrency:                 100
Successful requests:                     998
Benchmark duration (s):                  107.82
Total input tokens:                      617971
Total input vision tokens:               530936
Total generated tokens:                  508951
Request throughput (req/s):              9.26
Output token throughput (tok/s):         4720.29
Total token throughput (tok/s):          10451.68
Mean TTFT (ms):                          425.89
Median TTFT (ms):                        109.57
Mean TPOT (ms):                          19.45
Median ITL (ms):                         15.11
==================================================
Performance tuning: On B200, raising --scheduler-recv-interval to 16 lifted text throughput from 5497 to 5673 tok/s output (≈ +3%) at concurrency 100 with no accuracy change, by reducing the scheduler’s per-step Python overhead. It is a safe, low-risk knob for high-concurrency serving.

5.2 Accuracy Benchmark

Test Environment:
  • Hardware: H200
  • SGLang Version: gemma4 branch

MMLU

ModelHumanitiesSocial SciencesSTEMOtherOverall
gemma-4-E2B-it0.6210.7390.8300.7360.720
gemma-4-E4B-it0.7030.8620.9020.8250.810
gemma-4-12B-it0.7840.8880.9460.8610.859
gemma-4-31B-it0.8780.9210.8840.9110.896
gemma-4-26B-A4B-it0.8530.9060.9380.8860.891

GSM8K

ModelAccuracyInvalidLatency (s)Output Throughput (tok/s)
gemma-4-E2B-it0.1700.0003.9908041.739
gemma-4-E4B-it0.7450.0004.1744672.030
gemma-4-12B-it0.4310.05255.1056580.229
gemma-4-31B-it0.8050.00516.1481559.914
gemma-4-26B-A4B-it0.4500.01013.0014089.457
Note: These GSM8K numbers use the raw few-shot completion harness (sglang.test.few_shot_gsm8k). gemma-4-12B-it is reasoning-oriented and is under-elicited by raw few-shot prompting; with the chat template it scores 0.950 on the same 1319 GSM8K test questions (sglang.test.run_eval --eval-name gsm8k).

gemma-4-12B-it with sgl-eval

gemma-4-12B-it is reasoning-oriented and answers verbosely (step-by-step) rather than emitting a terse final line. Strict last-line Answer: $LETTER extraction (as in sglang.test.run_eval) therefore undercounts its correct answers. sgl-eval — sgl-project’s evaluation CLI, which uses robust answer extraction — gives a faithful score on the served model:
BenchmarkExamplesAccuracy
MMLU20000.878
GSM8K13190.960
Reproduce against a running server (--base-url points at your endpoint):
Command
pip install git+https://github.com/sgl-project/sgl-eval

# Sanity-check the endpoint
sgl-eval ping --base-url http://localhost:30000/v1

# Run the benchmarks (greedy, single-shot)
sgl-eval run gsm8k --base-url http://localhost:30000/v1
sgl-eval run mmlu  --base-url http://localhost:30000/v1 --num-examples 2000

MMMU

ModelOverall
gemma-4-E2B-it0.307
gemma-4-E4B-it0.396
gemma-4-12B-it0.683
gemma-4-31B-it0.589
gemma-4-26B-A4B-it0.549