Evaluating New Models with SGLang#

This document provides commands for evaluating models’ accuracy and performance. Before open-sourcing new models, we strongly suggest running these commands to verify whether the score matches your internal benchmark results.

For cross verification, please submit commands for installation, server launching, and benchmark running with all the scores and hardware requirements when open-sourcing your models.

Reference: MiniMax M2

Accuracy#

LLMs#

SGLang provides built-in scripts to evaluate common benchmarks.

MMLU

python -m sglang.test.run_eval \
  --eval-name mmlu \
  --port 30000 \
  --num-examples 1000 \
  --max-tokens 8192

GSM8K

python -m sglang.test.few_shot_gsm8k \
  --host 127.0.0.1 \
  --port 30000 \
  --num-questions 200 \
  --num-shots 5

HellaSwag

python benchmark/hellaswag/bench_sglang.py \
  --host 127.0.0.1 \
  --port 30000 \
  --num-questions 200 \
  --num-shots 20

GPQA

python -m sglang.test.run_eval \
  --eval-name gpqa \
  --port 30000 \
  --num-examples 198 \
  --max-tokens 120000 \
  --repeat 8

Tip

For reasoning models, add --thinking-mode <mode> (e.g., qwen3, deepseek-v3). You may skip it if the model has forced thinking enabled.

HumanEval

pip install human_eval

python -m sglang.test.run_eval \
  --eval-name humaneval \
  --num-examples 10 \
  --port 30000

VLMs#

MMMU

python benchmark/mmmu/bench_sglang.py \
  --port 30000 \
  --concurrency 64

Tip

You can set max tokens by passing --extra-request-body '{"max_tokens": 4096}'.

For models capable of processing video, we recommend extending the evaluation to include VideoMME, MVBench, and other relevant benchmarks.

Performance#

Performance benchmarks measure Latency (Time To First Token - TTFT) and Throughput (tokens/second).

LLMs#

Latency-Sensitive Benchmark

This simulates a scenario with low concurrency (e.g., single user) to measure latency.

python -m sglang.bench_serving \
  --backend sglang \
  --host 0.0.0.0 \
  --port 30000 \
  --dataset-name random \
  --num-prompts 10 \
  --max-concurrency 1

Throughput-Sensitive Benchmark

This simulates a high-traffic scenario to measure maximum system throughput.

python -m sglang.bench_serving \
  --backend sglang \
  --host 0.0.0.0 \
  --port 30000 \
  --dataset-name random \
  --num-prompts 1000 \
  --max-concurrency 100

Single Batch Performance

You can also benchmark the performance of processing a single batch offline.

python -m sglang.bench_one_batch_server \
  --model <model-path> \
  --batch-size 8 \
  --input-len 1024 \
  --output-len 1024

You can run more granular benchmarks:

Low Concurrency: --num-prompts 10 --max-concurrency 1
Medium Concurrency: --num-prompts 80 --max-concurrency 16
High Concurrency: --num-prompts 500 --max-concurrency 100

Reporting Results#

For each evaluation, please report:

Metric Score: Accuracy % (LLMs and VLMs); Latency (ms) and Throughput (tok/s) (LLMs only).
Environment settings: GPU type/count, SGLang commit hash.
Launch configuration: Model path, TP size, and any special flags.
Evaluation parameters: Number of shots, examples, max tokens.

Evaluating New Models with SGLang

Contents

Evaluating New Models with SGLang#

Accuracy#

LLMs#

VLMs#

Performance#

LLMs#

Reporting Results#