Documentation Index
Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
Use this file to discover all available pages before exploring further.
This page walks through performance testing your SGLang deployment on Ascend NPUs. We cover three model types β text generation (Qwen/Qwen2.5-7B-Instruct), multimodal vision (Qwen/Qwen2.5-VL-7B-Instruct), and embedding (Qwen/Qwen3-Embedding-8B) β in both online and offline serving modes. You can use Evalscope, AISBench, or SGLangβs built-in benchmarking tools.
The benchmark output examples in this guide are for illustration only. Actual performance depends on your hardware (e.g., Atlas 800I A2 vs A3), model version, SGLang version, and deployment configuration. Always run benchmarks on your own hardware to obtain accurate performance data.
1. Prepare
1.1 Start SGLang server
Launch the server with the appropriate flags for each model type. Make sure SGLang is installed first β see Ascend NPU Quickstart for environment setup.
Text Generation
Multimodal
Embedding
# The model will be automatically downloaded by sglang or set --model-path to the local path if the model is already downloaded.
sglang serve --model-path Qwen/Qwen2.5-7B-Instruct
# The model will be automatically downloaded by sglang or set --model-path to the local path if the model is already downloaded.
sglang serve --model-path Qwen/Qwen2.5-VL-7B-Instruct --mm-attention-backend ascend_attn
# The model will be automatically downloaded by sglang or set --model-path to the local path if the model is already downloaded.
sglang serve --model-path Qwen/Qwen3-Embedding-8B --is-embedding
Add & at the end of the command to run the server in the background, or open a new terminal to run the benchmark commands in the following sections.
The server binds to http://127.0.0.1:30000 by default. All online benchmarks below assume the server is running at that address. The --is-embedding flag is required for embedding models.
bench_serving and bench_offline_throughput are built into SGLang and require no extra installation. For Evalscope and AISBench, set up each in its own virtual environment:
python3 -m venv .evalscope_venv
source .evalscope_venv/bin/activate
pip install evalscope[perf] -U
python3 -m venv .aisbench_venv
source .aisbench_venv/bin/activate
git clone https://github.com/AISBench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517
pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt
Run ais_bench -h to verify.AISBench requires Python 3.10-3.12. After installation, all AISBench commands must be run from the benchmark/ directory (the cloned repo root). Set stream=True and ignore_eos=True in the model config for accurate results.
2. Online Service: Text Generation Model
Test Qwen/Qwen2.5-7B-Instruct via the online serving endpoint.
Before running any benchmark in this section, make sure the SGLang text-generation server is running at
http://127.0.0.1:30000. See
Start SGLang server for the launch command.
For performance testing, prefer random datasets (--dataset random, --dataset-name random) over real datasets. Random datasets let you pin --min-prompt-length / --max-prompt-length and --min-tokens / --max-tokens to fixed values, producing consistent, repeatable results. Real datasets (ShareGPT, openqa, etc.) have variable input lengths that add noise and make cross-run comparisons unreliable.
2.1 Using Evalscope
Prerequisites:
Evalscope installed and its virtual environment activated (
source .evalscope_venv/bin/activate). SGLang server running at
http://127.0.0.1:30000.
Run the following command to run a performance test against the server:
evalscope perf \
--parallel 10 \
--number 20 \
--model Qwen/Qwen2.5-7B-Instruct \
--url http://127.0.0.1:30000/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 1024 \
--min-tokens 1024 \
--prefix-length 0 \
--min-prompt-length 1024 \
--max-prompt-length 1024 \
--tokenizer-path Qwen/Qwen2.5-7B-Instruct \
--extra-args '{"ignore_eos": true}'
If the model has already been downloaded, you can point --tokenizer-path to the local model path instead of the model id.
Example output (for illustration only β actual results depend on your hardware and configuration):
Benchmarking summary:
ββββββββββββββββββββββββββββββ¬ββββββββββββββ
β Metric β Value β
ββββββββββββββββββββββββββββββΌββββββββββββββ€
β ββ General ββ β β
β Test Duration (s) β 89.34 β
β Concurrency β 10 β
β Request Rate (req/s) β -1.00 β
β Total / Success / Failed β 20 / 20 / 0 β
β Req Throughput (req/s) β 0.22 β
β ββ Latency ββ β β
β Avg Latency (s) β 44.67 β
β TTFT (ms) β 578.51 β
β TPOT (ms) β 43.10 β
β ITL (ms) β 43.12 β
β ββ Tokens ββ β β
β Avg Input Tokens β 1024.00 β
β Avg Output Tokens β 1024.00 β
β Output Throughput (tok/s) β 229.24 β
β Total Throughput (tok/s) β 458.49 β
β ββ Speculative Decoding ββ β β
β Decoded Tok/Iter β 1.00 β
β Spec. Accept Rate β 0.00 β
ββββββββββββββββββββββββββββββ΄ββββββββββββββ
Percentile results:
ββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β Metric β 1% β 5% β 10% β 25% β 50% β 75% β 90% β 95% β 99% β
ββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Latency (s) β 44.47 β 44.47 β 44.47 β 44.47 β 44.86 β 44.86 β 44.86 β 44.86 β 44.86 β
β TTFT (ms) β 138.12 β 142.07 β 426.17 β 426.87 β 783.67 β 785.26 β 786.85 β 787.97 β 787.97 β
β ITL (ms) β 41.84 β 42.14 β 42.22 β 42.36 β 42.57 β 42.80 β 42.99 β 49.24 β 49.84 β
β TPOT (ms) β 42.71 β 42.71 β 42.71 β 43.05 β 43.08 β 43.43 β 43.43 β 43.71 β 43.71 β
β Input tokens β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β
β Output tokens β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β 1024.00 β
β Output (tok/s) β 22.83 β 22.83 β 22.83 β 22.83 β 23.02 β 23.03 β 23.03 β 23.03 β 23.03 β
β Total (tok/s) β 45.65 β 45.65 β 45.65 β 45.65 β 46.05 β 46.05 β 46.05 β 46.05 β 46.05 β
β Decode (tok/s) β 22.88 β 23.03 β 23.03 β 23.07 β 23.21 β 23.42 β 23.42 β 23.42 β 23.42 β
ββββββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
...
2.2 Using AISBench
Prerequisites:
AISBench installed and its virtual environment activated (
source .aisbench_venv/bin/activate). All commands must be run from the
benchmark/ directory. SGLang server running at
http://127.0.0.1:30000. Set
stream=True and
ignore_eos=True in the model config for accurate results.
Two files need to be configured for performance testing.
First, describe the model and server settings in ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py:
# more details: https://ais-bench-benchmark.readthedocs.io/en/latest/base_tutorials/scenes_intro/performance_benchmark.html
from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content
models = [
dict(
attr="service",
type=VLLMCustomAPIChat,
abbr="vllm-api-stream-chat",
path="Qwen/Qwen2.5-7B-Instruct",
model="Qwen/Qwen2.5-7B-Instruct",
stream=True,
request_rate=0,
use_timestamp=False,
retry=2,
api_key="",
host_ip="127.0.0.1",
host_port=30000,
url="",
max_out_len=512,
batch_size=32,
trust_remote_code=False,
generation_kwargs=dict(
temperature=0.01,
ignore_eos=True,
),
pred_postprocessor=dict(type=extract_non_reasoning_content),
)
]
If the model has already been downloaded, point path to the local model path instead of the model id.
Second, configure random prompt lengths in ais_bench/datasets/synthetic/synthetic_config.py:
# more details: https://ais-bench-benchmark.readthedocs.io/en/latest/advanced_tutorials/synthetic_dataset.html
synthetic_config = {
"Type":"tokenid",
"RequestCount": 10,
"TrustRemoteCode": False,
"StringConfig" : {
"Input" : {
"Method": "uniform",
"Params": {"MinValue": 1, "MaxValue": 200}
},
"Output" : {
"Method": "gaussian",
"Params": {"Mean": 100, "Var": 200, "MinValue": 1, "MaxValue": 100}
}
},
"TokenIdConfig" : {
"RequestSize": 10,
"PrefixLen": 0
}
}
Run with a synthetic dataset:
ais_bench --models vllm_api_stream_chat --datasets synthetic_gen_string -m perf
Example output (for illustration only β actual results depend on your hardware and configuration):
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P75 β P90 β P99 β N β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββ‘
β E2EL β total β 3896.4 ms β 3081.6 ms β 4175.3 ms β 4013.8 ms β 4123.4 ms β 4137.1 ms β 4171.5 ms β 10 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββ€
β TTFT β total β 411.6 ms β 346.7 ms β 439.7 ms β 416.3 ms β 426.6 ms β 434.4 ms β 439.2 ms β 10 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββ€
β TPOT β total β 38.3 ms β 37.4 ms β 39.0 ms β 38.3 ms β 38.7 ms β 38.9 ms β 39.0 ms β 10 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββ€
β ITL β total β 38.7 ms β 0.0 ms β 156.5 ms β 38.9 ms β 39.0 ms β 39.2 ms β 117.1 ms β 10 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββ€
β InputTokens β total β 123.4 β 34.0 β 228.0 β 130.5 β 170.5 β 217.2 β 226.92 β 10 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββ€
β OutputTokens β total β 92.1 β 69.0 β 100.0 β 95.0 β 99.75 β 100.0 β 100.0 β 10 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββ€
β OutputTokenThroughput β total β 23.5937 token/s β 22.3912 token/s β 24.2616 token/s β 23.7399 token/s β 23.9919 token/s β 24.2027 token/s β 24.2557 token/s β 10 β
ββββββββββββββββββββββββββββ§ββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§ββββββ
ββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββββββ
β Common Metric β Stage β Value β
ββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββββ‘
β Benchmark Duration β total β 4175.4485 ms β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Total Requests β total β 10 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Failed Requests β total β 0 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Success Requests β total β 10 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Concurrency β total β 9.3317 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Max Concurrency β total β 32 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Request Throughput β total β 2.395 req/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Total Input Tokens β total β 1234 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Prefill Token Throughput β total β 299.8329 token/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Total Generated Tokens β total β 921 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Input Token Throughput β total β 295.5371 token/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Output Token Throughput β total β 220.5751 token/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Total Token Throughput β total β 516.1122 token/s β
ββββββββββββββββββββββββββββ§ββββββββββ§βββββββββββββββββββ
2.3 Using bench_serving
SGLangβs built-in bench_serving requires no extra installation. Make sure the server is running at http://127.0.0.1:30000 before running the benchmark.
python -m sglang.bench_serving \
--backend sglang-oai \
--base-url http://127.0.0.1:30000 \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 512 \
--random-range-ratio 1 \
--num-prompts 100 \
--max-concurrency 32
--dataset-name random samples token IDs from the ShareGPT dataset to generate realistic input; the first run downloads ShareGPT from Hugging Face automatically. Set export HF_ENDPOINT=https://hf-mirror.com if network is not available.
Set
--random-range-ratio 1 for fixed input/output lengths (recommended for consistent comparisons) or
0 (default) for uniform distribution. Add
--request-rate to control the request rate. For all backends, datasets, and advanced options, see the full
Bench Serving Guide.
Example output (for illustration only β actual results depend on your hardware and configuration):
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 32
Successful requests: 100
Benchmark duration (s): 47.51
Total input tokens: 102400
Total input text tokens: 102400
Total generated tokens: 51200
Total generated tokens (retokenized): 51195
Request throughput (req/s): 2.10
Input token throughput (tok/s): 2155.35
Output token throughput (tok/s): 1077.68
Peak output token throughput (tok/s): 1587.00
Peak concurrent requests: 64
Total token throughput (tok/s): 3233.03
Concurrency: 26.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 12793.49
Median E2E Latency (ms): 12940.17
P90 E2E Latency (ms): 13049.86
P99 E2E Latency (ms): 13051.61
---------------Time to First Token----------------
Mean TTFT (ms): 1423.99
Median TTFT (ms): 1489.29
P99 TTFT (ms): 2325.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 22.25
Median TPOT (ms): 22.22
P99 TPOT (ms): 25.08
---------------Inter-Token Latency----------------
Mean ITL (ms): 22.26
Median ITL (ms): 20.74
P95 ITL (ms): 21.40
P99 ITL (ms): 23.62
Max ITL (ms): 2229.30
==================================================
3. Online Service: Multimodal Model
Test Qwen/Qwen2.5-VL-7B-Instruct for vision-language tasks.
Before running any benchmark in this section, make sure the SGLang multimodal server is running at
http://127.0.0.1:30000. See
Start SGLang server and use the Multimodal tab for the launch command.
For consistent, repeatable results, set --random-range-ratio 1 to fix input/output lengths, or 0 (default) for uniform distribution.
3.1 Using Evalscope
Prerequisites:
Evalscope installed and its virtual environment activated (
source .evalscope_venv/bin/activate). SGLang multimodal server running at
http://127.0.0.1:30000.
Evalscopeβs perf tool uses the OpenAI-compatible /v1/chat/completions endpoint. Use --dataset random_vl for randomized multimodal data with image generation:
evalscope perf \
--parallel 10 \
--number 20 \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--url http://127.0.0.1:30000/v1/chat/completions \
--api openai \
--dataset random_vl \
--min-tokens 1024 \
--max-tokens 1024 \
--prefix-length 0 \
--min-prompt-length 1024 \
--max-prompt-length 1024 \
--image-width 512 \
--image-height 512 \
--image-format RGB \
--image-num 1 \
--tokenizer-path Qwen/Qwen2.5-VL-7B-Instruct \
--extra-args '{"ignore_eos": true}'
If the model has already been downloaded, you can point --tokenizer-path to the local model path instead of the model id.
3.2 Using AISBench
Prerequisites:
AISBench installed and its virtual environment activated (
source .aisbench_venv/bin/activate). All commands run from the
benchmark/ directory. SGLang multimodal server running at
http://127.0.0.1:30000. AISBench does not include a built-in multimodal dataset β you must provide your own.
First, edit ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py to configure the vision model:
from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content
models = [
dict(
attr="service",
type=VLLMCustomAPIChat,
abbr="vllm-api-stream-chat",
path="Qwen/Qwen2.5-VL-7B-Instruct",
model="Qwen/Qwen2.5-VL-7B-Instruct",
stream=True,
request_rate=0,
use_timestamp=False,
retry=2,
api_key="",
host_ip="127.0.0.1",
host_port=30000,
url="",
max_out_len=256,
batch_size=16,
trust_remote_code=False,
generation_kwargs=dict(
temperature=0.01,
ignore_eos=True,
),
pred_postprocessor=dict(type=extract_non_reasoning_content),
)
]
If the model has already been downloaded, point path to the local model path instead of the model id.
Next, download a multimodal dataset such as mmstar:
# Download the mmstar dataset (from within the benchmark/ directory)
cd ais_bench/datasets
mkdir mmstar
cd mmstar
wget https://www.modelscope.cn/datasets/evalscope/MMStar/resolve/master/MMStar.tsv
Run the performance test:
ais_bench --models vllm_api_stream_chat --datasets mmstar_gen -m perf
Example output (for illustration only β actual results depend on your hardware and configuration):
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββ€βββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€βββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P75 β P90 β P99 β N β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββͺβββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺβββββββ‘
β E2EL β total β 6190.9 ms β 5071.4 ms β 8464.8 ms β 6126.6 ms β 6475.2 ms β 6833.5 ms β 7897.9 ms β 1500 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββ€
β TTFT β total β 693.3 ms β 96.0 ms β 2161.5 ms β 747.4 ms β 870.9 ms β 1032.3 ms β 1620.8 ms β 1500 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββ€
β TPOT β total β 21.6 ms β 17.8 ms β 32.1 ms β 21.3 ms β 23.1 ms β 24.5 ms β 29.1 ms β 1500 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββ€
β ITL β total β 25.5 ms β 0.0 ms β 1951.1 ms β 18.8 ms β 19.7 ms β 37.3 ms β 121.8 ms β 1500 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββ€
β InputTokens β total β 0.0 β 0.0 β 0.0 β 0.0 β 0.0 β 0.0 β 0.0 β 1500 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββ€
β OutputTokens β total β 256.0 β 256.0 β 256.0 β 256.0 β 256.0 β 256.0 β 256.0 β 1500 β
ββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββ€
β OutputTokenThroughput β total β 41.6779 token/s β 30.243 token/s β 50.4791 token/s β 41.7847 token/s β 44.6424 token/s β 45.6484 token/s β 46.0932 token/s β 1500 β
ββββββββββββββββββββββββββββ§ββββββββββ§ββββββββββββββββββ§βββββββββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§ββββββββββββββββββ§βββββββ
βββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββββββ
β Common Metric β Stage β Value β
βββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββββ‘
β Benchmark Duration β total β 582099.6816 ms β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Total Requests β total β 1500 β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Failed Requests β total β 0 β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Success Requests β total β 1500 β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Concurrency β total β 15.9532 β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Max Concurrency β total β 16 β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Request Throughput β total β 2.5769 req/s β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Total Input Tokens β total β 0 β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Total Generated Tokens β total β 384000 β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Input Token Throughput β total β 0.0 token/s β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Output Token Throughput β total β 659.6808 token/s β
βββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββ€
β Total Token Throughput β total β 659.6808 token/s β
βββββββββββββββββββββββββββ§ββββββββββ§βββββββββββββββββββ
3.3 Using bench_serving (image dataset)
Set --dataset-name image for image datasets. bench_serving will generate random prompts with image inputs. Make sure the server is running at http://127.0.0.1:30000 before running the benchmark.
python -m sglang.bench_serving \
--backend sglang \
--base-url http://127.0.0.1:30000 \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--dataset-name image \
--random-input-len 1024 \
--random-output-len 512 \
--random-range-ratio 1 \
--num-prompts 32 \
--max-concurrency 16 \
--image-count 1 \
--image-resolution 720p
Example output (for illustration only β actual results depend on your hardware and configuration):
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 32
Benchmark duration (s): 51.74
Total input tokens: 73464
Total input text tokens: 35128
Total input vision tokens: 38336
Total generated tokens: 16384
Total generated tokens (retokenized): 9300
Request throughput (req/s): 0.62
Input token throughput (tok/s): 1419.96
Output token throughput (tok/s): 316.68
Peak output token throughput (tok/s): 800.00
Peak concurrent requests: 32
Total token throughput (tok/s): 1736.64
Concurrency: 15.98
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 25841.84
Median E2E Latency (ms): 25842.85
P90 E2E Latency (ms): 26296.42
P99 E2E Latency (ms): 26303.13
---------------Time to First Token----------------
Mean TTFT (ms): 12211.59
Median TTFT (ms): 14405.77
P99 TTFT (ms): 15837.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 26.67
Median TPOT (ms): 21.75
P99 TPOT (ms): 41.89
---------------Inter-Token Latency----------------
Mean ITL (ms): 26.67
Median ITL (ms): 20.34
P95 ITL (ms): 20.85
P99 ITL (ms): 21.70
Max ITL (ms): 11309.91
==================================================
4. Online Service: Embedding Model
Test Qwen/Qwen3-Embedding-8B on the embedding API endpoint.
Before running any benchmark in this section, make sure the SGLang embedding server is running with
--is-embedding at
http://127.0.0.1:30000. See
Start SGLang server and use the Embedding tab for the launch command. AISBench does not support embedding endpoints β use
bench_serving or Evalscope instead.
4.1 Using Evalscope
Prerequisites:
Evalscope installed and its virtual environment activated (
source .evalscope_venv/bin/activate). SGLang embedding server running with
--is-embedding at
http://127.0.0.1:30000.
Evalscope supports embedding evaluation. For performance testing the embedding API directly:
evalscope perf \
--parallel 10 \
--number 20 \
--model Qwen/Qwen3-Embedding-8B \
--url http://127.0.0.1:30000/v1/embeddings \
--api openai_embedding \
--dataset random_embedding \
--min-prompt-length 1024 \
--max-prompt-length 1024 \
--tokenizer-path Qwen/Qwen3-Embedding-8B
If the model has already been downloaded, you can point --tokenizer-path to the local model path instead of the model id.
4.2 Using bench_serving (embedding backend)
bench_serving is built into SGLang. Use --backend sglang-embedding to target the /v1/embeddings endpoint. Make sure the server is running with --is-embedding at http://127.0.0.1:30000.
python -m sglang.bench_serving \
--backend sglang-embedding \
--base-url http://127.0.0.1:30000 \
--model Qwen/Qwen3-Embedding-8B \
--dataset-name random \
--random-input-len 512 \
--random-output-len 0 \
--num-prompts 1000 \
--max-concurrency 64 \
--request-rate 32
--dataset-name random samples token IDs from the ShareGPT dataset; the first run downloads ShareGPT from Hugging Face automatically. Set export HF_ENDPOINT=https://hf-mirror.com if network is not available. Set --random-output-len 0 for embedding benchmarks β no output tokens are generated.
Example output (for illustration only β actual results depend on your hardware and configuration):
============ Serving Benchmark Result ============
Backend: sglang-embedding
Traffic request rate: 32.0
Max request concurrency: 64
Successful requests: 1000
Benchmark duration (s): 31.86
Total input tokens: 257891
Total input text tokens: 257891
Request throughput (req/s): 31.39
Input token throughput (tok/s): 8094.67
Peak concurrent requests: 62
Concurrency: 6.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 212.34
Median E2E Latency (ms): 160.97
P90 E2E Latency (ms): 267.31
P99 E2E Latency (ms): 1445.94
==================================================
SGLangβs Engine API runs inference in-process, without an HTTP server, letting you measure maximum throughput. bench_offline_throughput is built into SGLang and requires no extra installation or running server.
bench_offline_throughput currently only supports text-generation (LLM) benchmarks. Multimodal and embedding models are not supported.
5.1 Using bench_offline_throughput
bench_offline_throughput uses the Engine API internally and measures pure inference throughput without HTTP overhead:
python -m sglang.bench_offline_throughput \
--model-path Qwen/Qwen2.5-7B-Instruct \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 512 \
--num-prompts 500
--dataset-name random samples token IDs from the ShareGPT dataset; the first run downloads ShareGPT from Hugging Face automatically. Set export HF_ENDPOINT=https://hf-mirror.com if network is not available.
--dataset-name random with --random-input-len and --random-output-len gives you full control over input/output token counts. Fixed-length random data eliminates variance from real datasets, making throughput comparisons across runs deterministic and reliable.
See also