> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Bench Serving Guide

This guide explains how to benchmark online serving throughput and latency using `python -m sglang.bench_serving`. It supports multiple inference backends via OpenAI-compatible and native endpoints, and produces both console metrics and optional JSONL outputs.

### What it does

* Generates synthetic or dataset-driven prompts and submits them to a target serving endpoint
* Measures throughput, time-to-first-token (TTFT), inter-token latency (ITL), per-request end-to-end latency, and more
* Supports streaming or non-streaming modes, rate control, and concurrency limits

### Supported backends and endpoints

* `sglang` / `sglang-native`: `POST /generate`
* `sglang-oai`, `vllm`, `lmdeploy`: `POST /v1/completions`
* `sglang-oai-chat`, `vllm-chat`, `lmdeploy-chat`: `POST /v1/chat/completions`
* `trt` (TensorRT-LLM): `POST /v2/models/ensemble/generate_stream`
* `gserver`: Custom server (Not Implemented yet in this script)
* `truss`: `POST /v1/models/model:predict`

If `--base-url` is provided, requests are sent to it. Otherwise, `--host` and `--port` are used. When `--model` is not provided, the script will attempt to query `GET /v1/models` for an available model ID (OpenAI-compatible endpoints).

### Prerequisites

* Python 3.10+
* Dependencies typically used by this script: `aiohttp`, `numpy`, `requests`, `tqdm`, `transformers`, and for some datasets `datasets`, `pillow`, `pybase64`. Install as needed.
* An inference server running and reachable via the endpoints above
* If your server requires authentication, set environment variable `OPENAI_API_KEY` (used as `Authorization: Bearer <key>`)

### Quick start

Run a basic benchmark against an sglang server exposing `/generate`:

```bash Command theme={null}
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
```

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --num-prompts 1000 \
  --model meta-llama/Llama-3.1-8B-Instruct
```

Or, using an OpenAI-compatible endpoint (completions):

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --num-prompts 1000 \
  --model meta-llama/Llama-3.1-8B-Instruct
```

### Datasets

Select with `--dataset-name`:

* `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len`
* `random`: random text lengths; sampled from ShareGPT token space
* `random-ids`: random token ids (can lead to gibberish)
* `image`: generates images and wraps them in chat messages; supports custom resolutions, multiple formats, and different content types
* `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions
* `mmmu`: samples from MMMU (Math split) and includes images
* `speed-bench`: [SPEED-Bench](https://huggingface.co/datasets/nvidia/SPEED-Bench) (**SPEculative Evaluation Dataset**) — a unified benchmark for evaluating [Speculative Decoding (SD)](https://arxiv.org/abs/2604.09557) algorithms. Uses the Throughput split, which provides fixed-length input sequences (1K–32K tokens) grouped into three output-entropy categories (`low_entropy`, `mixed`, `high_entropy`). Requires a pre-downloaded JSONL file passed via `--dataset-path`.
* `agentic-trace`: replays pre-built multi-turn agentic traces (e.g. OpenHands / SWE-smith). Each conversation is replayed round by round, feeding the server's real assistant reply back into the next round's history. Requires a chat backend (`--backend sglang-oai-chat`) and a trace JSON passed via `--dataset-path`.

Common dataset flags:

* `--num-prompts N`: number of requests

* `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/image

* `--image-count`: Number of images per request (for `image` dataset).

* `--apply-chat-template`: apply tokenizer chat template when constructing prompts

* `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached

Generated Shared Prefix flags (for `generated-shared-prefix`):

* `--gsp-num-groups`
* `--gsp-prompts-per-group`
* `--gsp-system-prompt-len`
* `--gsp-question-len`
* `--gsp-output-len`
* `--gsp-group-distribution {uniform,zipf}`: per-request prefix-group sampling distribution (default: `uniform`). With `zipf`, each request's group is sampled by rank with `p(rank) = (1/rank**alpha) / sum_k(1/k**alpha)`; rank starts at 1 and group index 0 is the hottest. The on-disk dataset cache uses a distinct key per `(group_distribution, zipf_alpha)`, so uniform-mode caches are never mixed with zipf-mode caches.
* `--gsp-zipf-alpha FLOAT`: Zipf exponent for `--gsp-group-distribution=zipf`. Must be a finite float strictly greater than 0; larger values concentrate requests on lower-ranked (hotter) groups. Required when the distribution is `zipf`; must be omitted otherwise.

Image dataset flags (for `image`):

* `--image-count`: Number of images per request
* `--image-resolution`: Image resolution; supports presets (4k, 1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)
* `--image-format`: Image format (jpeg or png)
* `--image-content`: Image content type (random or blank)

Agentic trace flags (for `agentic-trace`):

* `--dataset-path`: path to the pre-built trace JSON
* `--sharegpt-output-len`: per-turn output length (default: 220)
* `--dataset-offset`: rotate the conversation list by this many entries before sampling, so successive sweep steps start on fresh conversations
* `--agentic-max-turns`: cap each conversation to at most this many turns (useful for small, fast profiling runs)

SPEED-Bench flags (for `speed-bench`):

* `--dataset-path PATH`: path to the pre-downloaded SPEED-Bench Throughput JSONL (e.g., `throughput_1k.jsonl`). Use the [SPEED-Bench measurement framework](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench) to generate it.
* `--speed-bench-category`: filter to one entropy category: `low_entropy`, `mixed`, or `high_entropy` (default: all)
* `--speed-bench-output-len`: fixed number of output tokens per request (default: 512)

### Examples

1. To benchmark image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:

```bash Command theme={null}
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
```

```bash Command theme={null}
python -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --dataset-name image \
    --num-prompts 500 \
    --image-count 3 \
    --image-resolution 720p \
    --random-input-len 512 \
    --random-output-len 512
```

2. To benchmark random dataset with 3000 prompts, 1024 input length, and 1024 output length, you can run:

```bash Command theme={null}
python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B-Instruct
```

```bash Command theme={null}
python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name random \
    --num-prompts 3000 \
    --random-input 1024 \
    --random-output 1024 \
    --random-range-ratio 0.5
```

3. To benchmark speculative decoding throughput using SPEED-Bench (mixed-entropy category, 1K ISL), you can run:

```bash Command theme={null}
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct \
    --speculative-algorithm EAGLE --speculative-draft-model-path <draft-model-path>
```

```bash Command theme={null}
python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name speed-bench \
    --dataset-path /path/to/throughput_1k.jsonl \
    --speed-bench-category mixed \
    --speed-bench-output-len 512 \
    --num-prompts 512
```

### Choosing model and tokenizer

* `--model` is required unless the backend exposes `GET /v1/models`, in which case the first model ID is auto-selected.
* `--tokenizer` defaults to `--model`. Both can be HF model IDs or local paths.
* For ModelScope workflows, setting `SGLANG_USE_MODELSCOPE=true` enables fetching via ModelScope (weights are skipped for speed).
* If your tokenizer lacks a chat template, the script warns because token counting can be less robust for gibberish outputs.

### Rate, concurrency, and streaming

* `--request-rate`: requests per second. `inf` sends all immediately (burst). Non-infinite rate uses a Poisson process for arrival times.
* `--max-concurrency`: caps concurrent in-flight requests regardless of arrival rate.
* `--disable-stream`: switch to non-streaming mode when supported; TTFT then equals total latency for chat completions.

### Other key options

* `--output-file FILE.jsonl`: append JSONL results to file; auto-named if unspecified
* `--output-details`: include per-request arrays (generated texts, errors, ttfts, itls, input/output lens)
* `--extra-request-body '&#123;"top_p":0.9,"temperature":0.6&#125;'`: merged into payload (sampling params, etc.)
* `--disable-ignore-eos`: pass through EOS behavior (varies by backend)
* `--warmup-requests N`: run warmup requests with short output first (default 1)
* `--flush-cache`: call `/flush_cache` (sglang) before main run
* `--profile`: call `/start_profile` and `/stop_profile` (requires server to enable profiling, e.g., `SGLANG_TORCH_PROFILER_DIR`)
* `--lora-name name1 name2 ...`: randomly pick one per request and pass to backend (e.g., `lora_path` for sglang)
* `--tokenize-prompt`: send integer IDs instead of text (currently supports `--backend sglang` only)

### Authentication

If your target endpoint requires OpenAI-style auth, set:

```bash Command theme={null}
export OPENAI_API_KEY=sk-...yourkey...
```

The script will add `Authorization: Bearer $OPENAI_API_KEY` automatically for OpenAI-compatible routes.

### Metrics explained

Printed after each run:

* Request throughput (req/s)
* Input token throughput (tok/s) - includes both text and vision tokens
* Output token throughput (tok/s)
* Total token throughput (tok/s) - includes both text and vision tokens
* Total input text tokens and Total input vision tokens - per-modality breakdown
* Concurrency: aggregate time of all requests divided by wall time
* End-to-End Latency (ms): mean/median/std/p99 per-request total latency
* Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode
* Inter-Token Latency (ITL, ms): mean/median/std/p95/p99/max between tokens
* TPOT (ms): Token processing time after first token, i.e., `(latency - ttft)/(tokens-1)`
* Accept length (sglang-only, if available): speculative decoding accept length

The script also retokenizes generated text with the configured tokenizer and reports "retokenized" counts.

### JSONL output format

When `--output-file` is set, one JSON object is appended per run. Base fields:

* Arguments summary: backend, dataset, request\_rate, max\_concurrency, etc.
* Duration and totals: completed, total\_input\_tokens, total\_output\_tokens, retokenized totals
* Throughputs and latency statistics as printed in the console
* `accept_length` when available (sglang)

With `--output-details`, an extended object also includes arrays:

* `input_lens`, `output_lens`
* `ttfts`, `itls` (per request: ITL arrays)
* `generated_texts`, `errors`

### End-to-end examples

1. sglang native `/generate` (streaming):

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name random \
  --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \
  --num-prompts 2000 \
  --request-rate 100 \
  --max-concurrency 512 \
  --output-file sglang_random.jsonl --output-details
```

2. OpenAI-compatible Completions (e.g., vLLM):

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --sharegpt-output-len 256
```

3. OpenAI-compatible Chat Completions (streaming):

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend vllm-chat \
  --base-url http://127.0.0.1:8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name random \
  --num-prompts 500 \
  --apply-chat-template
```

4. Images (VLM) with chat template:

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model your-vlm-model \
  --dataset-name image \
  --image-count 2 \
  --image-resolution 720p \
  --random-input-len 128 --random-output-len 256 \
  --num-prompts 200 \
  --apply-chat-template
```

4a) Images with custom resolution:

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model your-vlm-model \
  --dataset-name image \
  --image-count 1 \
  --image-resolution 512x768 \
  --random-input-len 64 --random-output-len 128 \
  --num-prompts 100 \
  --apply-chat-template
```

4b) 1080p images with PNG format and blank content:

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model your-vlm-model \
  --dataset-name image \
  --image-count 1 \
  --image-resolution 1080p \
  --image-format png \
  --image-content blank \
  --random-input-len 64 --random-output-len 128 \
  --num-prompts 100 \
  --apply-chat-template
```

5. Generated shared prefix (long system prompts + short questions):

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name generated-shared-prefix \
  --gsp-num-groups 64 --gsp-prompts-per-group 16 \
  --gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
  --num-prompts 1024
```

Zipfian / power-law prefix popularity (opt-in via `--gsp-group-distribution=zipf`):

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name generated-shared-prefix \
  --gsp-num-groups 64 --gsp-prompts-per-group 16 \
  --gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
  --gsp-group-distribution zipf --gsp-zipf-alpha 1.2 \
  --seed 42
```

`zipf` mode samples each request's prefix group from the rank-based distribution `p(rank) = (1/rank**alpha) / sum_k(1/k**alpha)` with rank starting at 1, so group index 0 is the hottest. The total request count stays `num_groups * prompts_per_group` — identical to `uniform` mode — and only the per-request group assignment changes. `alpha` must be a finite float strictly greater than 0; larger values concentrate requests on lower-ranked (hotter) groups.

The on-disk dataset cache at `~/.cache/sglang/benchmark/gen_shared_prefix_*.pkl` includes `group_distribution` and `zipf_alpha` in its key, so uniform-mode and zipf-mode runs (or two zipf runs with different alpha) never share a cache file. Uniform-mode filenames are unchanged from the legacy format, so existing caches remain valid.

This flag controls prefix-popularity shape only. It does not by itself reproduce any production trace or guarantee an observed cache-hit rate for a given engine.

6. Tokenized prompts (ids) for strict length control (sglang only):

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name random \
  --tokenize-prompt \
  --random-input-len 2048 --random-output-len 256 --random-range-ratio 0.2
```

7. Profiling and cache flush (sglang):

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --profile \
  --flush-cache
```

8. TensorRT-LLM streaming endpoint:

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend trt \
  --base-url http://127.0.0.1:8000 \
  --model your-trt-llm-model \
  --dataset-name random \
  --num-prompts 100 \
  --disable-ignore-eos
```

9. Evaluating large-scale KVCache sharing with mooncake trace (sglang only):

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --model model-name \
  --dataset-name mooncake \
  --mooncake-slowdown-factor 1.0 \
  --mooncake-num-rounds 1000 \
  --mooncake-workload conversation|mooncake|agent|synthetic
  --use-trace-timestamps true \
  --random-output-len 256
```

10. Fake decode stress testing (PD disaggregation, decode-only):

When benchmarking pure decode performance in a PD disaggregation setup, you can bypass the prefill node entirely by using `--fake-prefill`. This requires the decode server to be started with `--disaggregation-transfer-backend fake`:

```bash Command theme={null}
# Step 1: Start a decode-only server with fake transfer backend
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend fake \
  --port 30001

# Step 2: Run bench_serving with --fake-prefill
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30001 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name random \
  --num-prompts 500 \
  --random-input-len 1024 --random-output-len 256 \
  --fake-prefill
```

Similarly, `bench_one_batch_server` also supports `--fake-prefill`:

```bash Command theme={null}
python3 -m sglang.bench_one_batch_server \
  --base-url http://127.0.0.1:30001 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --batch-size 32 --input-len 1024 --output-len 256 \
  --fake-prefill
```

The `--fake-prefill` flag automatically injects special sentinel values into each request, telling the decode server to skip real KV transfer and generate fake KV data locally.

### Troubleshooting

* All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script.
* Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate.
* Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent.
* Image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`).
* Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server.

### Notes

* The script raises the file descriptor soft limit (`RLIMIT_NOFILE`) to help with many concurrent connections.
* For sglang, `/server_info` is queried post-run to report speculative decoding accept length when available.