During inference serving, it is sometimes necessary to monitor the internal execution flow of the serving framework to identify performance issues. By collecting start/end timestamps of key flows, identifying critical functions or iterations, recording key events, and gathering relevant information, you can quickly locate performance bottlenecks. This guide walks you through the complete workflow of collecting performance data in an SGLang Ascend NPU inference service — from preparation, collection, and analysis to visualization — helping you get started with performance profiling quickly. For more profiling scenarios (e.g., Nsight Systems, PD disaggregation, etc.), see SGLang Benchmark and Profiling.Documentation Index
Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
Use this file to discover all available pages before exploring further.
Ascend PyTorch Profiler
SGLang has built-in PyTorch Profiler support. Through the Ascendtorch_npu
backend, you can directly collect NPU operator-level performance data. No
additional packages are required — profiling start/stop is controlled via API
requests.
1. Environment Setup
Launch an SGLang online service and set theSGLANG_TORCH_PROFILER_DIR
environment variable to control where performance files are saved. Once the
service starts, profiling is ready on standby.
Command
On Ascend NPU, SGLang uses
torch_npu._apply_patches() to automatically
redirect PyTorch Profiler’s CUDA activity to NPU, so
activities: ["CPU", "GPU"] actually captures NPU operator events.| Variable | Description | Default |
|---|---|---|
SGLANG_TORCH_PROFILER_DIR | Trace file output directory | /tmp |
SGLANG_PROFILE_WITH_STACK | Record Python call stack (True / False) | True |
SGLANG_PROFILE_RECORD_SHAPES | Record operator input shapes (True / False) | True |
2. Collection Methods
SGLang provides four collection methods. The core differences are whether you need to manually send/start_profile and /stop_profile. All four methods
produce identical results — choose the most convenient one.
Method comparison:
| Method | Manual start_profile | Manual stop_profile | Notes |
|---|---|---|---|
| A: API manual start/stop | Yes | Yes | Maximum flexibility for precise control |
| B: API auto-stop | Yes | No | Set num_steps, auto-stops and generates output |
| C: bench_serving —profile | No | No | Benchmark + profiling in one command |
| D: sglang.profiler CLI | No | No | Standalone profiling CLI tool |
Method A: API Manual Start/Stop
Send/start_profile to start → send workload requests → send /stop_profile
to stop. After stopping, the server automatically parses the data — no need to
manually call analyse().
Command
/stop_profile returns "Stop profiling. This will take some time." — the
server needs time to flush trace data to disk and parse it. Wait for the
response to complete.
This method takes a significant amount of time to parse
profiling data;consider using Method B instead to avoid lengthy waits.Method B: API Auto-Stop
Specifynum_steps in the /start_profile request. Profiling stops
automatically after N steps and generates output — no need to manually send
/stop_profile.
Command
Method C: bench_serving —profile
Use SGLang’s built-inbench_serving with the --profile flag.
Automatically handles /start_profile and /stop_profile — no manual API
calls needed.
Command
--profile-steps N sends "num_steps": N to the server’s /start_profile, so
the server auto-stops and parses data after N steps — bench_serving skips
sending /stop_profile.bench_serving --profile creates a timestamp subdirectory inside
--profile-output-dir (e.g. <output_dir>/<timestamp>/). The output path is
shown in the server log as Profiling done. Traces are saved to: <path>.bench_serving --profile parameters:
| Parameter | Description |
|---|---|
—profile | Enable auto profiling start/stop |
—profile-steps N | Auto-stop after N steps (skips /stop_profile) |
—profile-output-dir | Trace output directory |
Method D: sglang.profiler CLI
Use thesglang.profiler CLI module, which automatically sends
/start_profile and waits for completion. Start sglang.profiler first,
then send inference requests (otherwise there are no steps to capture and the
profiler will wait indefinitely).
Command
Command
bench_serving --profile, which
handles both steps automatically:
Command
sglang.profiler is essentially a CLI wrapper around the /start_profile API.
Advanced options like --profile-by-stage are also supported. On Ascend NPU,
trace flushing is asynchronous and may take a while — the CLI may occasionally
block waiting for flush. If it times out, use Method B (API auto-stop) or
Method C (bench_serving —profile) instead.sglang.profiler CLI parameters:
| Parameter | Description |
|---|---|
—url | SGLang server address |
—output-dir | Output directory (defaults to SGLANG_TORCH_PROFILER_DIR) |
—num-steps | Number of steps to profile |
—profile-by-stage | Profile prefill / decode stages separately |
—profile-prefix | Trace filename prefix |
—cpu / —gpu / —mem / —rpd | Activity types to collect |
3. Full Parameter Reference
All methods ultimately send a/start_profile request to the server. The full
set of supported parameters:
| Parameter | Description | Default |
|---|---|---|
output_dir | Output directory. Falls back to
SGLANG_TORCH_PROFILER_DIR or /tmp | /tmp |
num_steps | Number of steps. If set, profiling auto-stops — no /stop_profile needed | None |
start_step | Step index to start profiling (inclusive), for skipping warmup | 0 |
activities | Activity types: CPU, GPU, MEM, RPD. On Ascend NPU, primarily CPU and GPU | [“CPU”, “GPU”] |
profile_by_stage | Profile prefill and decode stages separately | false |
with_stack | Record Python call stack. Also controllable via
SGLANG_PROFILE_WITH_STACK | true |
record_shapes | Record operator input shapes. Also controllable via
SGLANG_PROFILE_RECORD_SHAPES | true |
profile_prefix | Prefix for trace filenames | None |
profile_stages | Stages to profile, e.g. [“prefill”, “decode”].
Requires profile_by_stage | None |
4. Finding Output Files
The server log explicitly indicates where traces are saved. You can find them via:- When profiling starts: server log outputs
Profiling starts. Traces will be saved to: <path> (with profile id: <id>)
- When profiling stops: server log outputs
Profiling done. Traces are saved to: <path>
- CLI output:
sglang.profileroutputsDump profiling traces to <path>
<output_dir>/<hostname>_<pid>_<timestamp>_ascend_pt/. When using Method C
(bench_serving --profile), a timestamp subdirectory is added:
<output_dir>/<timestamp>/. Always check the server log for the exact path:
Profiling done. Traces are saved to: <path>.
5. Viewing Results
After profiling stops (either/stop_profile returns or num_steps
auto-triggers), the server automatically parses the raw data. The
ASCEND_PROFILER_OUTPUT directory directly contains the following visualization
files — no need to manually call analyse():
| File | Description |
|---|---|
trace_view.json | Chrome Tracing format. Open in MindStudio Insight |
analysis.db | Database-format performance data |
ascend_pytorch_profiler_0.db | Database-format performance data |
kernel_details.csv | Kernel-level data |
operator_details.csv | Operator-level data |
step_trace_time.csv | Step trace timing data |
trace_view.json can also be opened using Chrome’s built-in
chrome://tracing or Perfetto UI.If you need to merge distributed trace files in a multi-node deployment, set
"merge_profiles": true in the /start_profile request. Note: on Ascend NPU,
the merger has limited support for the *_ascend_pt format — check
trace_view.json on each node individually. See
Benchmark and Profiling
for details.6. Re-parsing Raw Data (Optional)
If you need to re-parse existing data with different parameters, or if profiling was interrupted andASCEND_PROFILER_OUTPUT was not auto-generated,
use torch_npu’s analyse() tool:
Normally no need to manually run
analyse() — the server already parses
data automatically. Only use this for re-parsing or handling interrupted data.Best Practices
Common Notes
- Finding output: Check the server log for
Profiling starts. Traces will be saved to: <path>andProfiling done. Traces are saved to: <path>, orsglang.profileroutput forDump profiling traces to <path>. - Control trace file size: Reduce the number of requests and output length
using
--num-promptsand--random-output-lento avoid trace files too large for browsers. - Warmup iterations: Set
start_stepto skip the first few warmup steps and capture performance data under steady state. - Profile step count: Large values for
num_stepsor--profile-stepscan lead to lengthy profiling data parsing times. Reduce these values appropriately when you only need a quick overview. - CUDA Graph impact: To see the full Python call stack → operator mapping in
traces, add
--disable-cuda-graphwhen starting the server. Note that this reduces decode performance — only use during profiling. To analyze CUDA Graph capture specifically, use--enable-profile-cuda-graph— traces are saved toSGLANG_TORCH_PROFILER_DIR/graph_capture_profile/. - Multi-node deployment: In multi-node environments, performance data is
distributed across nodes. On Ascend NPU, the
merge_profilesfeature has limited support — check*_ascend_pt/ASCEND_PROFILER_OUTPUT/trace_view.jsonon each node individually. In PD disaggregation mode, prefill and decode workers must be profiled separately — see Profile In PD Disaggregation Mode.
See Also
- SGLang Benchmark and Profiling — General SGLang profiling guide
- Ascend NPU Quickstart — Ascend NPU environment setup
- Ascend NPU Optimization — Ascend NPU optimization parameters
- Ascend NPU Performance Testing — Ascend NPU performance benchmarking
- Ascend NPU Environment Variables — Environment variable reference
