Diffusion language models

Diffusion language models have shown promise for non-autoregressive text generation with parallel decoding capabilities. Unlike auto-regressive language models, different diffusion language models require different decoding strategies.

Example Launch Command

SGLang supports different DLLM algorithms such as LowConfidence and JointThreshold.

Command

python3 -m sglang.launch_server \
  --model-path inclusionAI/LLaDA2.0-mini \ # example HF/local path
  --dllm-algorithm LowConfidence \
  --dllm-algorithm-config ./config.yaml \ # Optional. Uses the algorithm's default if not set.
  --host 0.0.0.0 \
  --port 30000

First-Done-First-Out (FDFO) Scheduling

FDFO scheduling is enabled by default: each request leaves the batch as soon as its block is resolved, instead of advancing in lockstep where fast-converging requests must wait for slow long-tail requests before leaving the batch (head-of-line blocking). This improves throughput and is orthogonal to --dllm-algorithm, so it works with any dLLM algorithm. Pass --no-dllm-fdfo to fall back to synchronous lockstep scheduling:

Command

python3 -m sglang.launch_server \
  --model-path inclusionAI/LLaDA2.0-mini \
  --dllm-algorithm LowConfidence \
  --no-dllm-fdfo \
  --host 0.0.0.0 \
  --port 30000

Example Configuration File

Depending on the algorithm selected, the configuration parameters vary. LowConfidence Config:

Config

# Confidence threshold for accepting predicted tokens
# - Higher values: More conservative, better quality but slower
# - Lower values: More aggressive, faster but potentially lower quality
# Range: 0.0 - 1.0
threshold: 0.95

# Default: 32, for LLaDA2MoeModelLM
block_size: 32

JointThreshold Config:

Config

# Decoding threshold for Mask-to-Token (M2T) phase
# - Higher values: More conservative, better quality but slower
# - Lower values: More aggressive, faster but potentially lower quality
# Range: 0.0 - 1.0
threshold: 0.5
# Decoding threshold for Token-to-Token (T2T) phase
# Range: 0.0 - 1.0
# Setting to 0.0 allows full editing (recommended for most cases).
edit_threshold: 0.0
# Max extra T2T steps after all masks are removed. Prevents infinite loops.
max_post_edit_steps: 16
# 2-gram repetition penalty (default 0).
# An empirical value of 3 is often sufficient to mitigate most repetitions.
penalty_lambda: 0

Example Client Code Snippet

Just like other supported models, diffusion language models can be used via the REST API or Python client. Python client example for making a generation request to the launched server:

Example

import sglang as sgl

def main():
    llm = sgl.Engine(model_path="inclusionAI/LLaDA2.0-mini",
                     dllm_algorithm="LowConfidence",
                     max_running_requests=1,
                     trust_remote_code=True)

    prompts = [
        "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write a brief introduction of the great wall <|role_end|><role>ASSISTANT</role>"
    ]

    sampling_params = {
        "temperature": 0,
        "max_new_tokens": 1024,
    }

    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

if __name__ == '__main__':
    main()

Curl example for making a generation request to the launched server:

Command

curl -X POST "http://127.0.0.1:30000/generate" \
     -H "Content-Type: application/json" \
     -d '{
        "text": [
            "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write the number from 1 to 128 <|role_end|><role>ASSISTANT</role>",
            "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write a brief introduction of the great wall <|role_end|><role>ASSISTANT</role>"
        ],
        "stream": true,
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 1024
        }
    }'

Supported Models

Below the supported models are summarized in a table.

Model Family	Example Model	Description
LLaDA2.0 (mini, flash)	`inclusionAI/LLaDA2.0-flash`	LLaDA2.0-flash is a diffusion language model featuring a 100B Mixture-of-Experts (MoE) architecture.
SDAR (JetLM)	`JetLM/SDAR-8B-Chat`	SDAR series diffusion language model (Chat), dense architecture.
SDAR (JetLM)	`JetLM/SDAR-30B-A3B-Chat`	SDAR series diffusion language model (Chat), MoE architecture.
DiffusionGemma	`google/diffusiongemma-26B-A4B-it`	Uniform-state (renoising) block-diffusion multimodal (text + image) MoE, 25.2B total / 3.8B active, served with the `Gemma4Renoise` sampler.

Basic Usage

Advanced Features

Supported Models

Developer Guide

References

Example Launch Command

First-Done-First-Out (FDFO) Scheduling

Example Configuration File

Example Client Code Snippet

Supported Models

​Example Launch Command

​First-Done-First-Out (FDFO) Scheduling

​Example Configuration File

​Example Client Code Snippet

​Supported Models

Example Launch Command

First-Done-First-Out (FDFO) Scheduling

Example Configuration File

Example Client Code Snippet

Supported Models