Skip to main content

1. Model Introduction

DiffusionGemma is a uniform-state (renoising) block-diffusion language model from Google. An encoder builds causal context, and a decoder denoises a fixed-length bidirectional canvas of canvas_length tokens. The Gemma4Renoise sampler runs max_denoising_steps reverse steps over the canvas, feeding the previous step’s logits back as self-conditioning and emitting the greedy argmax of the processed logits. Key Features:
  • Uniform-State Renoising: The canvas starts from random tokens and is refined each step by accepting confident positions and re-noising the rest, with no mask token.
  • Encoder / Decoder Canvas: The encoder produces causal context KV, the decoder attends bidirectionally over the canvas.
  • Self-Conditioning: Each step conditions on the previous step’s logits.
  • EntropyBound Acceptance: Each step accepts the lowest-entropy canvas positions within an entropy budget and re-noises the rest.
  • StableAndConfident Stopping: A canvas stops early once it is stable and confident.
  • MoE Architecture: The 26B-A4B model uses a Mixture-of-Experts architecture for efficient inference.
  • Multimodal Input: Accepts text and image inputs (via a ~550M vision encoder) and generates text output.
Available Models:
ModelArchitectureParameters
google/diffusiongemma-26B-A4B-itMoE, uniform-state diffusion (text + image)25.2B total / 3.8B active
Architecture Specifications:
SpecValue
Total Parameters25.2B
Active Parameters3.8B
Layers30
Sliding Window1024 tokens
Context LengthUp to 256K tokens
Canvas Length256
Vocabulary Size262K
Experts8 active / 128 total + 1 shared
Supported ModalitiesText, Image
Vision Encoder~550M parameters
License: Refer to the model card for license details.

2. SGLang Installation

Please refer to the official SGLang installation guide for installation instructions. The checkpoint ships its own modeling code, so --trust-remote-code is required when serving.

3. Model Deployment

3.1 Basic Configuration

The required runtime settings are applied automatically for Gemma4Renoise (the Triton attention backend, eager mode, and unchunked prefill, needed because the full-attention head_dim is 512 and the canvas uses bidirectional attention), so a default launch works:
Command
sglang serve \
  --model-path google/diffusiongemma-26B-A4B-it \
  --dllm-algorithm Gemma4Renoise \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 30000

3.2 Configuration Tips

dLLM-Specific Parameters:
ParameterDescriptionRecommended Value
--dllm-algorithmDiffusion decoding algorithmGemma4Renoise
--trust-remote-codeRequired to load the checkpoint’s modeling codeAlways enabled
--dllm-algorithm-configOptional YAML overriding the renoise scheduleCheckpoint defaults
The attention backend, eager mode, and unchunked prefill are selected automatically for Gemma4Renoise, so they do not need to be passed on the command line. Sampling is governed by the renoise schedule. Request-level logprobs, penalties, logit_bias, and grammar / structured output (json_schema / regex / ebnf / structural_tag) are not applied and are rejected with a 400. Core sampling controls (temperature, top_k, top_p) are accepted but have no effect. Streaming is block-level: one fully-denoised canvas per chunk. Gemma4Renoise Config (defaults follow the checkpoint’s generation_config.json):
Config
# Number of reverse denoising steps per canvas.
max_denoising_steps: 48
# Optional. Makes the renoise sampling reproducible (also shared across TP ranks).
seed: 1234
sampler_config:
  # Entropy budget. Accept the lowest-entropy canvas positions within this bound each step (the rest are re-noised).
  entropy_bound: 0.1
# Linear temperature schedule applied over the denoising steps.
temperature_schedule:
  t_min: 0.4
  t_max: 0.8
# Stop early once the canvas is stable and confident.
stopping_config:
  confidence_threshold: 0.005
  stability_threshold: 1

4. Model Invocation

4.1 Deployment

Start the server with the command from Section 3.1.

4.2 Basic Usage

Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/diffusiongemma-26B-A4B-it",
    messages=[
        {"role": "user", "content": "What are the key differences between TCP and UDP?"}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

4.3 Streaming

Streaming emits one fully-denoised canvas per chunk.
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="google/diffusiongemma-26B-A4B-it",
    messages=[
        {"role": "user", "content": "Write a Python function to compute the Fibonacci sequence."}
    ],
    max_tokens=2048,
    stream=True
)

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)

print()

5. Benchmark

5.1 Speed Benchmark

Not benchmarked for speed.

5.2 Accuracy Benchmark

Full test splits, every item scored (no failed-request exclusions). Text MCQ benchmarks use greedy generate-and-parse, MATH uses boxed-answer extraction plus sympy equivalence. MMLU, ARC-Challenge, and MATH-500 are the mean of two independent server launches.
BenchmarkScore
GSM8K95.4%
ARC-Challenge91.6%
HumanEval92.7% pass@1
MMLU76.2%
MMLU-Pro73.7%
GSM-Symbolic92.2%
MATH-50072.1%
AIME-202610.0%
HMMT-Feb-202510.0%
GPQA-main59.2%
Multimodal, full standard split per task (MMMU / MMMU-Pro / MMStar / AI2D as multiple-choice, MathVista testmini, DocVQA by ANLS, ChartQA by relaxed accuracy):
Multimodal benchmarkScore
MMMU (val, MC)64.9%
MMMU-Pro (standard 10-opt, MC)57.3%
MathVista (testmini)68.4%
DocVQA (val)85.9%
ChartQA (test)61.7%
AI2D (test)78.7%
MMStar (val)65.9%