Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.sglang.io/llms.txt

Use this file to discover all available pages before exploring further.

During inference serving, it is sometimes necessary to monitor the internal execution flow of the serving framework to identify performance issues. By collecting start/end timestamps of key flows, identifying critical functions or iterations, recording key events, and gathering relevant information, you can quickly locate performance bottlenecks. This guide walks you through the complete workflow of collecting performance data in an SGLang Ascend NPU inference service — from preparation, collection, and analysis to visualization — helping you get started with performance profiling quickly. For more profiling scenarios (e.g., Nsight Systems, PD disaggregation, etc.), see SGLang Benchmark and Profiling.

Ascend PyTorch Profiler

SGLang has built-in PyTorch Profiler support. Through the Ascend torch_npu backend, you can directly collect NPU operator-level performance data. No additional packages are required — profiling start/stop is controlled via API requests.

1. Environment Setup

Launch an SGLang online service and set the SGLANG_TORCH_PROFILER_DIR environment variable to control where performance files are saved. Once the service starts, profiling is ready on standby.
Command
# Set the performance data output directory
export SGLANG_TORCH_PROFILER_DIR=./sglang_profile

# Start SGLang server (use local model path or HuggingFace model id)
sglang serve \
  --model-path /path/to/your/model \
  --attention-backend ascend \
  --host 0.0.0.0 --port 30000 \
  --tp-size 1 \
  --max-running-requests 128
On Ascend NPU, SGLang uses torch_npu._apply_patches() to automatically redirect PyTorch Profiler’s CUDA activity to NPU, so activities: ["CPU", "GPU"] actually captures NPU operator events.
Profiling-related environment variables:
VariableDescriptionDefault
SGLANG_TORCH_PROFILER_DIRTrace file output directory/tmp
SGLANG_PROFILE_WITH_STACKRecord Python call stack (True / False)True
SGLANG_PROFILE_RECORD_SHAPESRecord operator input shapes (True / False)True

2. Collection Methods

SGLang provides four collection methods. The core differences are whether you need to manually send /start_profile and /stop_profile. All four methods produce identical results — choose the most convenient one. Method comparison:
MethodManual start_profileManual stop_profileNotes
A: API manual start/stopYesYesMaximum flexibility for precise control
B: API auto-stopYesNoSet num_steps, auto-stops and generates output
C: bench_serving —profileNoNoBenchmark + profiling in one command
D: sglang.profiler CLINoNoStandalone profiling CLI tool

Method A: API Manual Start/Stop

Send /start_profile to start → send workload requests → send /stop_profile to stop. After stopping, the server automatically parses the data — no need to manually call analyse().
Command
# Step 1: Start profiling (no num_steps, requires manual stop)
curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "output_dir": "./sglang_profile",
    "start_step": 1,
    "activities": ["CPU", "GPU"]
  }'

# Step 2: Send workload requests (using curl as example)
curl http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 10}}'

# Step 3: Stop profiling
curl -X POST http://127.0.0.1:30000/stop_profile
/stop_profile returns "Stop profiling. This will take some time." — the server needs time to flush trace data to disk and parse it. Wait for the response to complete. This method takes a significant amount of time to parse profiling data;consider using Method B instead to avoid lengthy waits.

Method B: API Auto-Stop

Specify num_steps in the /start_profile request. Profiling stops automatically after N steps and generates output — no need to manually send /stop_profile.
Command
# num_steps=10, wait 3 warmup steps, auto-stop after 10 steps
curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "output_dir": "./sglang_profile",
    "start_step": 3,
    "num_steps": 10,
    "activities": ["CPU", "GPU"]
  }'

# Just send workload — no /stop_profile needed
curl http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 32}}'

Method C: bench_serving —profile

Use SGLang’s built-in bench_serving with the --profile flag. Automatically handles /start_profile and /stop_profile — no manual API calls needed.
Command
# With --profile-steps: auto-stops after N steps and generates output
python -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:30000 \
  --model /path/to/your/model \
  --tokenizer /path/to/your/model \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 100 \
  --num-prompts 10 \
  --profile \
  --profile-steps 10 \
  --profile-output-dir ./sglang_profile

# Without --profile-steps: /stop_profile sent automatically after benchmark
python -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:30000 \
  --model /path/to/your/model \
  --tokenizer /path/to/your/model \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 100 \
  --num-prompts 10 \
  --profile \
  --profile-output-dir ./sglang_profile
--profile-steps N sends "num_steps": N to the server’s /start_profile, so the server auto-stops and parses data after N steps — bench_serving skips sending /stop_profile.
bench_serving --profile creates a timestamp subdirectory inside --profile-output-dir (e.g. <output_dir>/<timestamp>/). The output path is shown in the server log as Profiling done. Traces are saved to: <path>.
bench_serving --profile parameters:
ParameterDescription
—profileEnable auto profiling start/stop
—profile-steps NAuto-stop after N steps (skips /stop_profile)
—profile-output-dirTrace output directory

Method D: sglang.profiler CLI

Use the sglang.profiler CLI module, which automatically sends /start_profile and waits for completion. Start sglang.profiler first, then send inference requests (otherwise there are no steps to capture and the profiler will wait indefinitely).
Command
# Terminal 1: Start sglang.profiler first (sends /start_profile, then waits for completion)
python3 -m sglang.profiler \
  --url http://127.0.0.1:30000 \
  --output-dir ./my_profiles \
  --num-steps 3 \
  --cpu --gpu &
Command
# Terminal 2: Immediately send inference requests to provide steps for profiling
curl http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 32}}'
A simpler and more reliable approach is to use bench_serving --profile, which handles both steps automatically:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:30000 \
  --model /path/to/your/model \
  --tokenizer /path/to/your/model \
  --dataset-name random \
  --random-input-len 128 \
  --random-output-len 32 \
  --num-prompts 10 \
  --profile \
  --profile-steps 3 \
  --profile-output-dir ./my_profiles
sglang.profiler is essentially a CLI wrapper around the /start_profile API. Advanced options like --profile-by-stage are also supported. On Ascend NPU, trace flushing is asynchronous and may take a while — the CLI may occasionally block waiting for flush. If it times out, use Method B (API auto-stop) or Method C (bench_serving —profile) instead.
sglang.profiler CLI parameters:
ParameterDescription
—urlSGLang server address
—output-dirOutput directory (defaults to SGLANG_TORCH_PROFILER_DIR)
—num-stepsNumber of steps to profile
—profile-by-stageProfile prefill / decode stages separately
—profile-prefixTrace filename prefix
—cpu / —gpu / —mem / —rpdActivity types to collect

3. Full Parameter Reference

All methods ultimately send a /start_profile request to the server. The full set of supported parameters:
ParameterDescriptionDefault
output_dirOutput directory. Falls back to SGLANG_TORCH_PROFILER_DIR or /tmp/tmp
num_stepsNumber of steps. If set, profiling auto-stops — no /stop_profile neededNone
start_stepStep index to start profiling (inclusive), for skipping warmup0
activitiesActivity types: CPU, GPU, MEM, RPD. On Ascend NPU, primarily CPU and GPU[“CPU”, “GPU”]
profile_by_stageProfile prefill and decode stages separatelyfalse
with_stackRecord Python call stack. Also controllable via SGLANG_PROFILE_WITH_STACKtrue
record_shapesRecord operator input shapes. Also controllable via SGLANG_PROFILE_RECORD_SHAPEStrue
profile_prefixPrefix for trace filenamesNone
profile_stagesStages to profile, e.g. [“prefill”, “decode”]. Requires profile_by_stageNone

4. Finding Output Files

The server log explicitly indicates where traces are saved. You can find them via:
  • When profiling starts: server log outputs Profiling starts. Traces will be saved to: <path> (with profile id: <id>)
[2026-05-19 13:23:15] Profiling starts. Traces will be saved to: /tmp/1779196995.6948605 (with profile id: 1779196995.6979997)
[2026-05-19 13:23:15] [WARNING] [350443] profiler.py: Invalid parameter export_type: None, reset it to text.
[2026-05-19 13:23:15] [WARNING] [350443] profiler.py: Invalid parameter export_type: None, reset it to text.
[2026-05-19 13:23:15] INFO:     127.0.0.1:40714 - "POST /start_profile HTTP/1.1" 200 OK
  • When profiling stops: server log outputs Profiling done. Traces are saved to: <path>
[2026-05-19 13:23:17] Stop profiling...
[2026-05-19 13:23:17] [WARNING] [350443] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
[rank0]:[W519 13:23:17.084812760 compiler_depend.ts:3136] Warning: The indexFromRank 0is not equal indexFromCurDevice 4 , which might be normal if the number of devices on your collective communication server is inconsistent.Otherwise, you need to check if the current device is correct when calling the interface.If it's incorrect, it might have introduced an error. (function operator())
[2026-05-19 13:23:17] [INFO] [352725] profiler.py: Start parsing profiling data: /tmp/1779196995.6948605/localhost.localdomain_350443_20260519132315700_ascend_pt
[2026-05-19 13:23:22] [INFO] [352734] profiler.py: CANN profiling data parsed in a total time of 0:00:04.022310
[2026-05-19 13:23:32] [INFO] [352725] profiler.py: All profiling data parsed in a total time of 0:00:14.305669
[2026-05-19 13:23:32] Profiling done. Traces are saved to: /tmp/1779196995.6948605
  • CLI output: sglang.profiler outputs Dump profiling traces to <path>
Dump profiling traces to /tmp/1779243331.3219
Waiting for 10 steps and the trace to be flushed.... (profile_by_stage=False)
The directory structure is <output_dir>/<hostname>_<pid>_<timestamp>_ascend_pt/. When using Method C (bench_serving --profile), a timestamp subdirectory is added: <output_dir>/<timestamp>/. Always check the server log for the exact path: Profiling done. Traces are saved to: <path>.

5. Viewing Results

After profiling stops (either /stop_profile returns or num_steps auto-triggers), the server automatically parses the raw data. The ASCEND_PROFILER_OUTPUT directory directly contains the following visualization files — no need to manually call analyse():
FileDescription
trace_view.jsonChrome Tracing format. Open in MindStudio Insight
analysis.dbDatabase-format performance data
ascend_pytorch_profiler_0.dbDatabase-format performance data
kernel_details.csvKernel-level data
operator_details.csvOperator-level data
step_trace_time.csvStep trace timing data
trace_view.json can also be opened using Chrome’s built-in chrome://tracing or Perfetto UI.
If you need to merge distributed trace files in a multi-node deployment, set "merge_profiles": true in the /start_profile request. Note: on Ascend NPU, the merger has limited support for the *_ascend_pt format — check trace_view.json on each node individually. See Benchmark and Profiling for details.

6. Re-parsing Raw Data (Optional)

If you need to re-parse existing data with different parameters, or if profiling was interrupted and ASCEND_PROFILER_OUTPUT was not auto-generated, use torch_npu’s analyse() tool:
from torch_npu.profiler.profiler import analyse
analyse("./sglang_profile/<hostname>_*_ascend_pt/")
Normally no need to manually run analyse() — the server already parses data automatically. Only use this for re-parsing or handling interrupted data.

Best Practices

Common Notes

  • Finding output: Check the server log for Profiling starts. Traces will be saved to: <path> and Profiling done. Traces are saved to: <path>, or sglang.profiler output for Dump profiling traces to <path>.
  • Control trace file size: Reduce the number of requests and output length using --num-prompts and --random-output-len to avoid trace files too large for browsers.
  • Warmup iterations: Set start_step to skip the first few warmup steps and capture performance data under steady state.
  • Profile step count: Large values for num_steps or --profile-steps can lead to lengthy profiling data parsing times. Reduce these values appropriately when you only need a quick overview.
  • CUDA Graph impact: To see the full Python call stack → operator mapping in traces, add --disable-cuda-graph when starting the server. Note that this reduces decode performance — only use during profiling. To analyze CUDA Graph capture specifically, use --enable-profile-cuda-graph — traces are saved to SGLANG_TORCH_PROFILER_DIR/graph_capture_profile/.
  • Multi-node deployment: In multi-node environments, performance data is distributed across nodes. On Ascend NPU, the merge_profiles feature has limited support — check *_ascend_pt/ASCEND_PROFILER_OUTPUT/trace_view.json on each node individually. In PD disaggregation mode, prefill and decode workers must be profiled separately — see Profile In PD Disaggregation Mode.

See Also