DiffusionGemma - SGLang Documentation

1. Model Introduction

DiffusionGemma is a uniform-state (renoising) block-diffusion language model from Google. An encoder builds causal context, and a decoder denoises a fixed-length bidirectional canvas of canvas_length tokens. The Gemma4Renoise sampler runs max_denoising_steps reverse steps over the canvas, feeding the previous step’s logits back as self-conditioning and emitting the greedy argmax of the processed logits. Key Features:

Uniform-State Renoising: The canvas starts from random tokens and is refined each step by accepting confident positions and re-noising the rest, with no mask token.
Encoder / Decoder Canvas: The encoder produces causal context KV, the decoder attends bidirectionally over the canvas.
Self-Conditioning: Each step conditions on the previous step’s logits.
EntropyBound Acceptance: Each step accepts the lowest-entropy canvas positions within an entropy budget and re-noises the rest.
StableAndConfident Stopping: A canvas stops early once it is stable and confident.
MoE Architecture: The 26B-A4B model uses a Mixture-of-Experts architecture for efficient inference.
Multimodal Input: Accepts text and image inputs (via a ~550M vision encoder) and generates text output.

Available Models:

Model	Architecture	Parameters
google/diffusiongemma-26B-A4B-it	MoE, uniform-state diffusion (text + image)	25.2B total / 3.8B active

Architecture Specifications:

Spec	Value
Total Parameters	25.2B
Active Parameters	3.8B
Layers	30
Sliding Window	1024 tokens
Context Length	Up to 256K tokens
Canvas Length	256
Vocabulary Size	262K
Experts	8 active / 128 total + 1 shared
Supported Modalities	Text, Image
Vision Encoder	~550M parameters

License: Refer to the model card for license details.

2. SGLang Installation

Please refer to the official SGLang installation guide for installation instructions. The checkpoint ships its own modeling code, so --trust-remote-code is required when serving.

3. Model Deployment

3.1 Basic Configuration

The required runtime settings are applied automatically for Gemma4Renoise (the Triton attention backend, eager mode, and unchunked prefill, needed because the full-attention head_dim is 512 and the canvas uses bidirectional attention), so a default launch works:

Command

sglang serve \
  --model-path google/diffusiongemma-26B-A4B-it \
  --dllm-algorithm Gemma4Renoise \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 30000

3.2 Configuration Tips

dLLM-Specific Parameters:

Parameter	Description	Recommended Value
`--dllm-algorithm`	Diffusion decoding algorithm	`Gemma4Renoise`
`--trust-remote-code`	Required to load the checkpoint’s modeling code	Always enabled
`--dllm-algorithm-config`	Optional YAML overriding the renoise schedule	Checkpoint defaults

The attention backend, eager mode, and unchunked prefill are selected automatically for Gemma4Renoise, so they do not need to be passed on the command line. Sampling is governed by the renoise schedule. Request-level logprobs, penalties, logit_bias, and grammar / structured output (json_schema / regex / ebnf / structural_tag) are not applied and are rejected with a 400. Core sampling controls (temperature, top_k, top_p) are accepted but have no effect. Streaming is block-level: one fully-denoised canvas per chunk. Gemma4Renoise Config (defaults follow the checkpoint’s generation_config.json):

Config

# Number of reverse denoising steps per canvas.
max_denoising_steps: 48
# Optional. Makes the renoise sampling reproducible (also shared across TP ranks).
seed: 1234
sampler_config:
  # Entropy budget. Accept the lowest-entropy canvas positions within this bound each step (the rest are re-noised).
  entropy_bound: 0.1
# Linear temperature schedule applied over the denoising steps.
temperature_schedule:
  t_min: 0.4
  t_max: 0.8
# Stop early once the canvas is stable and confident.
stopping_config:
  confidence_threshold: 0.005
  stability_threshold: 1

4. Model Invocation

4.1 Deployment

Start the server with the command from Section 3.1.

4.2 Basic Usage

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/diffusiongemma-26B-A4B-it",
    messages=[
        {"role": "user", "content": "What are the key differences between TCP and UDP?"}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

4.3 Streaming

Streaming emits one fully-denoised canvas per chunk.

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/diffusiongemma-26B-A4B-it",
    messages=[
        {"role": "user", "content": "Write a Python function to compute the Fibonacci sequence."}
    ],
    max_tokens=2048,
    stream=True
)

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)

print()

5. Benchmark

5.1 Speed Benchmark

Not benchmarked for speed.

5.2 Accuracy Benchmark

Full test splits, every item scored (no failed-request exclusions). Text MCQ benchmarks use greedy generate-and-parse, MATH uses boxed-answer extraction plus sympy equivalence. MMLU, ARC-Challenge, and MATH-500 are the mean of two independent server launches.

Benchmark	Score
GSM8K	95.4%
ARC-Challenge	91.6%
HumanEval	92.7% pass@1
MMLU	76.2%
MMLU-Pro	73.7%
GSM-Symbolic	92.2%
MATH-500	72.1%
AIME-2026	10.0%
HMMT-Feb-2025	10.0%
GPQA-main	59.2%

Multimodal, full standard split per task (MMMU / MMMU-Pro / MMStar / AI2D as multiple-choice, MathVista testmini, DocVQA by ANLS, ChartQA by relaxed accuracy):

Multimodal benchmark	Score
MMMU (val, MC)	64.9%
MMMU-Pro (standard 10-opt, MC)	57.3%
MathVista (testmini)	68.4%
DocVQA (val)	85.9%
ChartQA (test)	61.7%
AI2D (test)	78.7%
MMStar (val)	65.9%

​1. Model Introduction

​2. SGLang Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Deployment

​4.2 Basic Usage

​4.3 Streaming

​5. Benchmark

​5.1 Speed Benchmark

​5.2 Accuracy Benchmark