Gemma 4 - SGLang Documentation

1. Model Introduction

Gemma 4 is Google’s next-generation family of open models, building on the Gemma 3 architecture with improved performance, MoE variants, and multimodal support for text, vision, and audio. Key Features:

Hybrid Attention: Combines sliding window and full attention layers for efficient long-context processing
Multimodal: Supports text, image, and audio inputs via dedicated vision and audio encoders
MoE Variant: The 26B-A4B model uses a Mixture-of-Experts architecture for efficient inference
Per-Layer Embeddings (PLE): Layer-specific token embeddings for enhanced representations
Reasoning: Built-in thinking mode with gemma4 reasoning parser
Tool Calling: Function call support with streaming via gemma4 tool call parser
Fused Operations: Triton-optimized RMSNorm + residual + scalar kernels

Available Models:

Model	Architecture	Parameters
google/gemma-4-E2B-it	Dense	~2B
google/gemma-4-E4B-it	Dense	~4B
google/gemma-4-12B-it	Dense	12B
google/gemma-4-31B-it	Dense	31B
google/gemma-4-26B-A4B-it	MoE	26B total / 4B active

2. SGLang Installation

Gemma 4 (including the encoder-free unified 12B, sgl-project/sglang#27167) is supported on SGLang main. Install it together with the matching transformers commit:

Command

# Install SGLang from main
pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'

# Install transformers with Gemma 4 support (encoder-free unified family included)
pip install 'git+https://github.com/huggingface/transformers.git@1423d22f7a3b62e8c70ad67b58ec25cd9b675897'

Docker

lmsysorg/sglang:latest (CUDA 13.0, multi-arch amd64 + arm64) runs on both Hopper (H200) and Blackwell (B200 / GB200 / GB300):

Command

docker run --gpus all --ipc=host --shm-size 32g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 30000:30000 \
  lmsysorg/sglang:latest \
  sglang serve --model-path google/gemma-4-12B-it \
    --reasoning-parser gemma4 --tool-call-parser gemma4 \
    --host 0.0.0.0 --port 30000

For other installation methods, please refer to the official SGLang installation guide.

3. Model Deployment

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model variant.

3.2 Configuration Tips

SGLang automatically selects the Triton attention backend for Gemma 4 models (required for bidirectional image-token attention during prefill).
Attention backend on Blackwell (B200/sm100): SGLang defaults to the trtllm_mha backend on sm100, which is fastest for text but applies causal attention to image tokens. For multimodal (image) workloads on B200, pass --attention-backend triton to restore bidirectional image-token attention and full vision quality. Text-only and audio workloads are unaffected by the default.
Gemma 4 26B-A4B on B200: Use --mem-fraction-static 0.75 to leave workspace headroom for the Triton MoE path.
For the 26B-A4B MoE model, consider --tp 2 for high-throughput workloads.
Speculative Decoding (MTP): Each Gemma 4 variant ships with a paired *-assistant draft model that enables NEXTN multi-token prediction. Enable it via the selector above, or pass --speculative-algorithm NEXTN --speculative-draft-model-path google/gemma-4-<variant>-it-assistant --speculative-num-steps 5 --speculative-num-draft-tokens 6 --speculative-eagle-topk 1. MTP can significantly reduce latency for interactive use cases. The 26B-A4B MoE model requires --tp 2 when MTP is enabled.
QAT checkpoints: Toggle Checkpoint → QAT in the selector to target the qat-q4_0-unquantized releases. These keep bf16 weights, so memory and TP requirements match the standard checkpoints, and each has a matching *-qat-q4_0-unquantized-assistant draft model for MTP.
Hardware requirements:

Model	Hardware	TP
gemma-4-E2B-it	1x H200 / 1x B200 / 1x B300 / 1x MI300X / 1x MI325X / 1x MI355X	1
gemma-4-E4B-it	1x H200 / 1x B200 / 1x B300 / 1x MI300X / 1x MI325X / 1x MI355X	1
gemma-4-12B-it	1x H200 / 1x B200 / 1x B300	1
gemma-4-31B-it	2x H200 / 1x B200 / 1x B300 / 1x MI300X / 1x MI325X / 1x MI355X	2 (H200) / 1 (B200/B300/AMD)
gemma-4-26B-A4B-it	1x H200 / 1x B200 / 1x B300 / 1x MI300X / 1x MI325X / 1x MI355X	1

3.3 AMD GPU Deployment (MI300X / MI325X / MI355X)

SGLang automatically selects the correct attention backend on AMD GPUs. For the small E-models (gemma-4-E2B-it, gemma-4-E4B-it), disable AITER on AMD GPUs and use the same command line otherwise:

Command

SGLANG_USE_AITER=0 sglang serve --model-path google/gemma-4-E4B-it \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

For gemma-4-31B-it and gemma-4-26B-A4B-it, the same commands above work on MI300X, MI325X, and MI355X without additional command-line changes.

Status: AMD benchmarks are available in Section 5.1.

4. Model Invocation

Deploy gemma-4-26B-A4B-it (MoE) with all features enabled:

Command

sglang serve --model-path google/gemma-4-26B-A4B-it \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

Speculative Decoding (MTP) Server Commands

Each Gemma 4 variant ships with a paired *-assistant draft model for NEXTN multi-token prediction. Use the commands below to enable MTP for the corresponding target model. These match the configuration generated when you toggle Speculative Decoding (MTP) → Enabled in the interactive selector.

Command

# Gemma 4 E2B + MTP
sglang serve \
  --model-path google/gemma-4-E2B-it \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-E2B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85

Command

# Gemma 4 E4B + MTP
sglang serve \
  --model-path google/gemma-4-E4B-it \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-E4B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85

Command

# Gemma 4 12B + MTP (~35% faster single-stream decode on H200)
sglang serve \
  --model-path google/gemma-4-12B-it \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-12B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85

Command

# Gemma 4 31B + MTP
sglang serve \
  --model-path google/gemma-4-31B-it \
  --tp-size 2 \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-31B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85

Command

# Gemma 4 26B-A4B + MTP
sglang serve \
  --model-path google/gemma-4-26B-A4B-it \
  --tp-size 2 \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-26B-A4B-it-assistant \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 1 \
  --mem-fraction-static 0.85

4.1 Basic Usage

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "What are the key differences between TCP and UDP?"}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

4.2 Vision Input

Gemma 4 multimodal variants accept images alongside text:

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this image in detail."
                }
            ]
        }
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

4.3 Reasoning (Thinking Mode)

Gemma 4 supports hybrid reasoning. Thinking is not enabled by default — pass chat_template_kwargs: {"enable_thinking": true} via extra_body to activate it. The reasoning parser separates thinking and content, returning the thinking process via reasoning_content in the streaming response.

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "Solve step by step: If a train travels at 60 km/h for 2.5 hours, how far does it go?"}
    ],
    max_tokens=4096,
    stream=True,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

thinking_started = False
has_thinking = False
has_answer = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

4.4 Tool Calling

Gemma 4 supports function calling with the gemma4 tool call parser. Enable it during deployment with --tool-call-parser gemma4.

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ],
    tools=tools,
    stream=True
)

thinking_started = False
has_thinking = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            if has_thinking and thinking_started:
                print("\n=============== Tool Calls ================", flush=True)
                thinking_started = False
            for tool_call in delta.tool_calls:
                if tool_call.function:
                    print(f"Tool Call: {tool_call.function.name}")
                    print(f"   Arguments: {tool_call.function.arguments}")

        if delta.content:
            print(delta.content, end="", flush=True)

print()

4.5 Audio Input

The audio-capable Gemma 4 variants (gemma-4-E2B-it, gemma-4-E4B-it, gemma-4-12B-it) accept raw audio alongside text. Pass the waveform as a base64 audio_url data URI (16 kHz mono WAV works well):

Example

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

with open("sample.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="google/gemma-4-12B-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
                {"type": "text", "text": "Transcribe the speech in this audio exactly."},
            ],
        }
    ],
    max_tokens=256,
    temperature=0,
)

print(response.choices[0].message.content)

For best ASR quality, use the recommended transcription prompt structure:

Prompt

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

For speech translation (AST), ask for the transcription in the source language first, then the translation: “Transcribe the following speech segment in , then translate it into . …“

5. Benchmark

5.1 Speed Benchmark

Test Environment:

Hardware: H200
SGLang Version: gemma4 branch

gemma-4-E2B-it (1x H200, TP=1)

Server Launch Command:

Command

sglang serve --model-path google/gemma-4-E2B-it

Latency Benchmark (Text)

Command

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 10 --max-concurrency 1

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  17.44
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.57
Output token throughput (tok/s):         242.03
Total token throughput (tok/s):          591.94
Mean TTFT (ms):                          50.19
Median TTFT (ms):                        54.22
Mean TPOT (ms):                          3.99
Median ITL (ms):                         4.05
==================================================

Latency Benchmark (Image)

Command

python3 -m sglang.bench_serving --backend sglang-oai-chat \
  --host 0.0.0.0 --port 30000 \
  --dataset-name image --image-count 2 --image-resolution 720p \
  --random-input-len 128 --random-output-len 1024 \
  --num-prompts 10 --max-concurrency 1

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  18.05
Total input tokens:                      6097
Total input vision tokens:               5340
Total generated tokens:                  4220
Request throughput (req/s):              0.55
Output token throughput (tok/s):         233.84
Total token throughput (tok/s):          571.69
Mean TTFT (ms):                          109.59
Median TTFT (ms):                        112.62
Mean TPOT (ms):                          4.01
Median ITL (ms):                         4.04
==================================================

Throughput Benchmark (Text)

Command

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 1000 --max-concurrency 100

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  51.73
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              19.33
Output token throughput (tok/s):         9876.36
Peak output token throughput (tok/s):    13863.00
Total token throughput (tok/s):          19791.14
Mean TTFT (ms):                          86.57
Mean TPOT (ms):                          9.56
Median ITL (ms):                         5.99
==================================================

Throughput Benchmark (Image)

Command

python3 -m sglang.bench_serving --backend sglang-oai-chat \
  --host 0.0.0.0 --port 30000 \
  --dataset-name image --image-count 2 --image-resolution 720p \
  --random-input-len 128 --random-output-len 1024 \
  --num-prompts 1000 --max-concurrency 100

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  89.07
Total input tokens:                      617799
Total input vision tokens:               534000
Total generated tokens:                  510855
Request throughput (req/s):              11.23
Output token throughput (tok/s):         5735.75
Peak output token throughput (tok/s):    12823.00
Total token throughput (tok/s):          12672.23
Mean TTFT (ms):                          636.46
Mean TPOT (ms):                          16.34
Median ITL (ms):                         5.68
==================================================

gemma-4-E4B-it (1x H200, TP=1)

Server Launch Command:

Command

sglang serve --model-path google/gemma-4-E4B-it

Latency Benchmark (Text)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  24.49
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.41
Output token throughput (tok/s):         172.32
Total token throughput (tok/s):          421.45
Mean TTFT (ms):                          52.76
Median TTFT (ms):                        53.66
Mean TPOT (ms):                          5.64
Median ITL (ms):                         5.74
==================================================

Latency Benchmark (Image)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  25.04
Total input tokens:                      6124
Total input vision tokens:               5340
Total generated tokens:                  4220
Request throughput (req/s):              0.40
Output token throughput (tok/s):         168.54
Total token throughput (tok/s):          413.13
Mean TTFT (ms):                          110.15
Median TTFT (ms):                        108.24
Mean TPOT (ms):                          5.66
Median ITL (ms):                         5.73
==================================================

Throughput Benchmark (Text)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  72.95
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              13.71
Output token throughput (tok/s):         7002.68
Peak output token throughput (tok/s):    9878.00
Total token throughput (tok/s):          14032.60
Mean TTFT (ms):                          166.33
Mean TPOT (ms):                          13.36
Median ITL (ms):                         8.88
==================================================

Throughput Benchmark (Image)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  108.99
Total input tokens:                      616952
Total input vision tokens:               534000
Total generated tokens:                  510855
Request throughput (req/s):              9.18
Output token throughput (tok/s):         4687.38
Peak output token throughput (tok/s):    9277.00
Total token throughput (tok/s):          10348.25
Mean TTFT (ms):                          626.17
Mean TPOT (ms):                          20.00
Median ITL (ms):                         8.64
==================================================

gemma-4-31B-it (2x H200, TP=2)

Server Launch Command:

Command

sglang serve --model-path google/gemma-4-31B-it --tp 2

Latency Benchmark (Text)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  53.05
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.19
Output token throughput (tok/s):         79.55
Total token throughput (tok/s):          194.55
Mean TTFT (ms):                          72.77
Median TTFT (ms):                        75.05
Mean TPOT (ms):                          12.32
Median ITL (ms):                         12.53
==================================================

Latency Benchmark (Image)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  53.78
Total input tokens:                      6162
Total input vision tokens:               5340
Total generated tokens:                  4220
Request throughput (req/s):              0.19
Output token throughput (tok/s):         78.46
Total token throughput (tok/s):          193.03
Mean TTFT (ms):                          143.35
Median TTFT (ms):                        146.85
Mean TPOT (ms):                          12.37
Median ITL (ms):                         12.48
==================================================

Throughput Benchmark (Text)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  182.00
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              5.49
Output token throughput (tok/s):         2806.82
Peak output token throughput (tok/s):    3798.00
Total token throughput (tok/s):          5624.56
Mean TTFT (ms):                          324.67
Mean TPOT (ms):                          33.95
Median ITL (ms):                         25.44
==================================================

Throughput Benchmark (Image)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  236.46
Total input tokens:                      621630
Total input vision tokens:               534000
Total generated tokens:                  510855
Request throughput (req/s):              4.23
Output token throughput (tok/s):         2160.42
Peak output token throughput (tok/s):    3745.00
Total token throughput (tok/s):          4789.30
Mean TTFT (ms):                          952.02
Mean TPOT (ms):                          44.17
Median ITL (ms):                         26.81
==================================================

gemma-4-26B-A4B-it (MoE, 1x H200, TP=1)

Server Launch Command:

Command

sglang serve --model-path google/gemma-4-26B-A4B-it

Tip: Consider --tp 2 for high-throughput workloads.

Latency Benchmark (Text)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  25.00
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.40
Output token throughput (tok/s):         168.81
Total token throughput (tok/s):          412.85
Mean TTFT (ms):                          103.74
Median TTFT (ms):                        46.57
Mean TPOT (ms):                          5.60
Median ITL (ms):                         5.78
==================================================

Latency Benchmark (Image)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  25.31
Total input tokens:                      6164
Total input vision tokens:               5340
Total generated tokens:                  4220
Request throughput (req/s):              0.40
Output token throughput (tok/s):         166.70
Total token throughput (tok/s):          410.20
Mean TTFT (ms):                          129.22
Median TTFT (ms):                        132.54
Mean TPOT (ms):                          5.68
Median ITL (ms):                         5.75
==================================================

Throughput Benchmark (Text)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  138.98
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              7.20
Output token throughput (tok/s):         3675.81
Peak output token throughput (tok/s):    4799.00
Total token throughput (tok/s):          7365.91
Mean TTFT (ms):                          153.77
Mean TPOT (ms):                          25.95
Median ITL (ms):                         20.23
==================================================

Throughput Benchmark (Image)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  186.38
Total input tokens:                      621146
Total input vision tokens:               534000
Total generated tokens:                  510855
Request throughput (req/s):              5.37
Output token throughput (tok/s):         2740.86
Peak output token throughput (tok/s):    4962.00
Total token throughput (tok/s):          6073.47
Mean TTFT (ms):                          854.71
Mean TPOT (ms):                          34.64
Median ITL (ms):                         19.08
==================================================

gemma-4-31B-it (1x MI300X, TP=1)

Server Launch Command:

Command

sglang serve --model-path google/gemma-4-31B-it

Note: The 31B dense model fits on a single MI300X (192 GB VRAM) at TP=1, unlike H200 (141 GB) which requires TP=2.

Latency Benchmark (Text)

Command

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 10 --max-concurrency 1

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  103.55
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.10
Output token throughput (tok/s):         40.75
Total token throughput (tok/s):          99.67
Mean TTFT (ms):                          152.35
Median TTFT (ms):                        169.66
Mean TPOT (ms):                          24.13
Median ITL (ms):                         24.23
==================================================

Throughput Benchmark (Text)

Command

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 1000 --max-concurrency 100

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  441.59
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              2.26
Output token throughput (tok/s):         1156.85
Peak output token throughput (tok/s):    1759.00
Total token throughput (tok/s):          2318.19
Mean TTFT (ms):                          819.22
Mean TPOT (ms):                          82.51
Median ITL (ms):                         63.45
==================================================

gemma-4-26B-A4B-it (MoE, 1x MI300X, TP=1)

Server Launch Command:

Command

sglang serve --model-path google/gemma-4-26B-A4B-it

Latency Benchmark (Text)

Command

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 10 --max-concurrency 1

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  43.73
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.23
Output token throughput (tok/s):         96.49
Total token throughput (tok/s):          236.00
Mean TTFT (ms):                          185.58
Median TTFT (ms):                        90.18
Mean TPOT (ms):                          9.78
Median ITL (ms):                         9.57
==================================================

Throughput Benchmark (Text)

Command

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 1000 --max-concurrency 100

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  219.43
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              4.56
Output token throughput (tok/s):         2328.05
Peak output token throughput (tok/s):    3500.00
Total token throughput (tok/s):          4665.16
Mean TTFT (ms):                          168.44
Mean TPOT (ms):                          41.23
Median ITL (ms):                         29.31
==================================================

gemma-4-12B-it (1x H200, TP=1)

Server Launch Command:

Command

sglang serve --model-path google/gemma-4-12B-it

Latency Benchmark (Text)

Command

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 10 --max-concurrency 1

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  38.66
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.26
Output token throughput (tok/s):         109.15
Total token throughput (tok/s):          266.94
Mean TTFT (ms):                          33.08
Median TTFT (ms):                        33.71
Mean TPOT (ms):                          9.02
Median ITL (ms):                         9.19
==================================================

Latency Benchmark (Image)

Command

python3 -m sglang.bench_serving --backend sglang-oai-chat \
  --host 0.0.0.0 --port 30000 \
  --dataset-name image --image-count 2 --image-resolution 720p \
  --random-input-len 128 --random-output-len 1024 \
  --num-prompts 10 --max-concurrency 1

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  39.36
Total input vision tokens:               5320
Total generated tokens:                  4220
Request throughput (req/s):              0.25
Output token throughput (tok/s):         107.23
Total token throughput (tok/s):          263.62
Mean TTFT (ms):                          94.98
Median TTFT (ms):                        97.33
Mean TPOT (ms):                          9.08
Median ITL (ms):                         9.17
==================================================

Throughput Benchmark (Text)

Command

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 1000 --max-concurrency 100

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  130.44
Total input tokens:                      512842
Total generated tokens:                  510855
Request throughput (req/s):              7.67
Output token throughput (tok/s):         3916.46
Total token throughput (tok/s):          7848.15
Mean TTFT (ms):                          207.49
Median TTFT (ms):                        76.95
Mean TPOT (ms):                          24.38
Median ITL (ms):                         17.89
==================================================

Throughput Benchmark (Image)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  147.57
Total input tokens:                      619609
Total input vision tokens:               532000
Total generated tokens:                  510855
Request throughput (req/s):              6.78
Output token throughput (tok/s):         3461.79
Total token throughput (tok/s):          7660.54
Mean TTFT (ms):                          438.40
Median TTFT (ms):                        129.83
Mean TPOT (ms):                          27.12
Median ITL (ms):                         19.16
==================================================

gemma-4-12B-it (1x B200, TP=1)

Server Launch Command:

Command

# Text/audio: the sm100 default (trtllm_mha) is fastest.
# For image workloads add --attention-backend triton (bidirectional image attention).
sglang serve --model-path google/gemma-4-12B-it --attention-backend triton

Latency Benchmark (Text)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  30.46
Output token throughput (tok/s):         138.55
Total token throughput (tok/s):          338.85
Mean TTFT (ms):                          28.14
Median TTFT (ms):                        29.74
Mean TPOT (ms):                          7.08
Median ITL (ms):                         7.26
==================================================

Latency Benchmark (Image)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  31.43
Total input vision tokens:               5320
Total generated tokens:                  4220
Request throughput (req/s):              0.32
Output token throughput (tok/s):         134.26
Total token throughput (tok/s):          329.57
Mean TTFT (ms):                          115.51
Median TTFT (ms):                        74.27
Mean TPOT (ms):                          7.14
Median ITL (ms):                         7.24
==================================================

Throughput Benchmark (Text)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  92.94
Request throughput (req/s):              10.76
Output token throughput (tok/s):         5496.55
Total token throughput (tok/s):          11014.49
Mean TTFT (ms):                          120.89
Median TTFT (ms):                        45.00
Mean TPOT (ms):                          17.23
Median ITL (ms):                         14.30
==================================================

Throughput Benchmark (Image)

Output

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Max request concurrency:                 100
Successful requests:                     998
Benchmark duration (s):                  107.82
Total input tokens:                      617971
Total input vision tokens:               530936
Total generated tokens:                  508951
Request throughput (req/s):              9.26
Output token throughput (tok/s):         4720.29
Total token throughput (tok/s):          10451.68
Mean TTFT (ms):                          425.89
Median TTFT (ms):                        109.57
Mean TPOT (ms):                          19.45
Median ITL (ms):                         15.11
==================================================

Performance tuning: On B200, raising --scheduler-recv-interval to 16 lifted text throughput from 5497 to 5673 tok/s output (≈ +3%) at concurrency 100 with no accuracy change, by reducing the scheduler’s per-step Python overhead. It is a safe, low-risk knob for high-concurrency serving.

5.2 Accuracy Benchmark

Test Environment:

Hardware: H200
SGLang Version: gemma4 branch

MMLU

Model	Humanities	Social Sciences	STEM	Other	Overall
gemma-4-E2B-it	0.621	0.739	0.830	0.736	0.720
gemma-4-E4B-it	0.703	0.862	0.902	0.825	0.810
gemma-4-12B-it	0.784	0.888	0.946	0.861	0.859
gemma-4-31B-it	0.878	0.921	0.884	0.911	0.896
gemma-4-26B-A4B-it	0.853	0.906	0.938	0.886	0.891

GSM8K

Model	Accuracy	Invalid	Latency (s)	Output Throughput (tok/s)
gemma-4-E2B-it	0.170	0.000	3.990	8041.739
gemma-4-E4B-it	0.745	0.000	4.174	4672.030
gemma-4-12B-it	0.431	0.052	55.105	6580.229
gemma-4-31B-it	0.805	0.005	16.148	1559.914
gemma-4-26B-A4B-it	0.450	0.010	13.001	4089.457

Note: These GSM8K numbers use the raw few-shot completion harness (sglang.test.few_shot_gsm8k). gemma-4-12B-it is reasoning-oriented and is under-elicited by raw few-shot prompting; with the chat template it scores 0.950 on the same 1319 GSM8K test questions (sglang.test.run_eval --eval-name gsm8k).

gemma-4-12B-it with sgl-eval

gemma-4-12B-it is reasoning-oriented and answers verbosely (step-by-step) rather than emitting a terse final line. Strict last-line Answer: $LETTER extraction (as in sglang.test.run_eval) therefore undercounts its correct answers. sgl-eval — sgl-project’s evaluation CLI, which uses robust answer extraction — gives a faithful score on the served model:

Benchmark	Examples	Accuracy
MMLU	2000	0.878
GSM8K	1319	0.960

Reproduce against a running server (--base-url points at your endpoint):

Command

pip install git+https://github.com/sgl-project/sgl-eval

# Sanity-check the endpoint
sgl-eval ping --base-url http://localhost:30000/v1

# Run the benchmarks (greedy, single-shot)
sgl-eval run gsm8k --base-url http://localhost:30000/v1
sgl-eval run mmlu  --base-url http://localhost:30000/v1 --num-examples 2000

MMMU

Model	Overall
gemma-4-E2B-it	0.307
gemma-4-E4B-it	0.396
gemma-4-12B-it	0.683
gemma-4-31B-it	0.589
gemma-4-26B-A4B-it	0.549

​1. Model Introduction

​2. SGLang Installation

​Docker

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​3.3 AMD GPU Deployment (MI300X / MI325X / MI355X)

​4. Model Invocation

​Speculative Decoding (MTP) Server Commands

​4.1 Basic Usage

​4.2 Vision Input

​4.3 Reasoning (Thinking Mode)

​4.4 Tool Calling

​4.5 Audio Input

​5. Benchmark

​5.1 Speed Benchmark

​gemma-4-E2B-it (1x H200, TP=1)

​gemma-4-E4B-it (1x H200, TP=1)

​gemma-4-31B-it (2x H200, TP=2)

​gemma-4-26B-A4B-it (MoE, 1x H200, TP=1)

​gemma-4-31B-it (1x MI300X, TP=1)

​gemma-4-26B-A4B-it (MoE, 1x MI300X, TP=1)

​gemma-4-12B-it (1x H200, TP=1)

​gemma-4-12B-it (1x B200, TP=1)

​5.2 Accuracy Benchmark

​MMLU

​GSM8K

​gemma-4-12B-it with sgl-eval

​MMMU

1. Model Introduction

2. SGLang Installation

Docker

3. Model Deployment

3.1 Basic Configuration

3.2 Configuration Tips

3.3 AMD GPU Deployment (MI300X / MI325X / MI355X)

4. Model Invocation

Speculative Decoding (MTP) Server Commands

4.1 Basic Usage

4.2 Vision Input

4.3 Reasoning (Thinking Mode)

4.4 Tool Calling

4.5 Audio Input

5. Benchmark

5.1 Speed Benchmark

gemma-4-E2B-it (1x H200, TP=1)

gemma-4-E4B-it (1x H200, TP=1)

gemma-4-31B-it (2x H200, TP=2)

gemma-4-26B-A4B-it (MoE, 1x H200, TP=1)

gemma-4-31B-it (1x MI300X, TP=1)

gemma-4-26B-A4B-it (MoE, 1x MI300X, TP=1)

gemma-4-12B-it (1x H200, TP=1)

gemma-4-12B-it (1x B200, TP=1)

5.2 Accuracy Benchmark

MMLU

GSM8K

gemma-4-12B-it with sgl-eval

MMMU