> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Ascend NPU Performance Profiling Guide

During inference serving, it is sometimes necessary to monitor the internal
execution flow of the serving framework to identify performance issues. By
collecting start/end timestamps of key flows, identifying critical functions or
iterations, recording key events, and gathering relevant information, you can
quickly locate performance bottlenecks.

This guide walks you through the complete workflow of collecting performance
data in an SGLang Ascend NPU inference service — from preparation, collection,
and analysis to visualization — helping you get started with performance
profiling quickly.

For more profiling scenarios (e.g., Nsight Systems, PD disaggregation, etc.),
see [SGLang Benchmark and Profiling](/docs/developer_guide/benchmark_and_profiling).

## Ascend PyTorch Profiler

SGLang has built-in PyTorch Profiler support. Through the Ascend `torch_npu`
backend, you can directly collect NPU operator-level performance data. No
additional packages are required — profiling start/stop is controlled via API
requests.

### 1. Environment Setup

Launch an SGLang online service and set the `SGLANG_TORCH_PROFILER_DIR`
environment variable to control where performance files are saved. Once the
service starts, profiling is ready on standby.

```shell Command theme={null}
# Set the performance data output directory
export SGLANG_TORCH_PROFILER_DIR=./sglang_profile

# Start SGLang server (use local model path or HuggingFace model id)
sglang serve \
  --model-path /path/to/your/model \
  --attention-backend ascend \
  --host 0.0.0.0 --port 30000 \
  --tp-size 1 \
  --max-running-requests 128
```

<Note>
  On Ascend NPU, SGLang uses `torch_npu._apply_patches()` to automatically
  redirect PyTorch Profiler's CUDA activity to NPU, so
  `activities: ["CPU", "GPU"]` actually captures NPU operator events.
</Note>

**Profiling-related environment variables:**

<table>
  <thead>
    <tr>
      <th>Variable</th>
      <th>Description</th>
      <th>Default</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td><code>SGLANG\_TORCH\_PROFILER\_DIR</code></td>
      <td>Trace file output directory</td>
      <td><code>/tmp</code></td>
    </tr>

    <tr>
      <td><code>SGLANG\_PROFILE\_WITH\_STACK</code></td>
      <td>Record Python call stack (True / False)</td>
      <td><code>True</code></td>
    </tr>

    <tr>
      <td><code>SGLANG\_PROFILE\_RECORD\_SHAPES</code></td>
      <td>Record operator input shapes (True / False)</td>
      <td><code>True</code></td>
    </tr>
  </tbody>
</table>

### 2. Collection Methods

SGLang provides four collection methods. The core differences are **whether you
need to manually send `/start_profile` and `/stop_profile`**. All four methods
produce identical results — choose the most convenient one.

**Method comparison:**

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Manual start\_profile</th>
      <th>Manual stop\_profile</th>
      <th>Notes</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>A: API manual start/stop</td>
      <td>Yes</td>
      <td>Yes</td>
      <td>Maximum flexibility for precise control</td>
    </tr>

    <tr>
      <td>B: API auto-stop</td>
      <td>Yes</td>
      <td>No</td>
      <td>Set <code>num\_steps</code>, auto-stops and generates output</td>
    </tr>

    <tr>
      <td>C: bench\_serving --profile</td>
      <td>No</td>
      <td>No</td>
      <td>Benchmark + profiling in one command</td>
    </tr>

    <tr>
      <td>D: sglang.profiler CLI</td>
      <td>No</td>
      <td>No</td>
      <td>Standalone profiling CLI tool</td>
    </tr>
  </tbody>
</table>

#### Method A: API Manual Start/Stop

Send `/start_profile` to start → send workload requests → send `/stop_profile`
to stop. After stopping, the server automatically parses the data — **no need to
manually call `analyse()`**.

```bash Command theme={null}
# Step 1: Start profiling (no num_steps, requires manual stop)
curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "output_dir": "./sglang_profile",
    "start_step": 1,
    "activities": ["CPU", "GPU"]
  }'

# Step 2: Send workload requests (using curl as example)
curl http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 10}}'

# Step 3: Stop profiling
curl -X POST http://127.0.0.1:30000/stop_profile
```

<Note>
  `/stop_profile` returns `"Stop profiling. This will take some time."` — the
  server needs time to flush trace data to disk and parse it. Wait for the
  response to complete.
  This method takes a significant amount of time to parse
  profiling data;consider using **Method B** instead to avoid lengthy waits.
</Note>

#### Method B: API Auto-Stop

Specify `num_steps` in the `/start_profile` request. Profiling stops
automatically after N steps and generates output — **no need to manually send
`/stop_profile`**.

```bash Command theme={null}
# num_steps=10, wait 3 warmup steps, auto-stop after 10 steps
curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "output_dir": "./sglang_profile",
    "start_step": 3,
    "num_steps": 10,
    "activities": ["CPU", "GPU"]
  }'

# Just send workload — no /stop_profile needed
curl http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 32}}'
```

#### Method C: bench\_serving --profile

Use SGLang's built-in `bench_serving` with the `--profile` flag.
**Automatically handles `/start_profile` and `/stop_profile`** — no manual API
calls needed.

```bash Command theme={null}
# With --profile-steps: auto-stops after N steps and generates output
python -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:30000 \
  --model /path/to/your/model \
  --tokenizer /path/to/your/model \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 100 \
  --num-prompts 10 \
  --profile \
  --profile-steps 10 \
  --profile-output-dir ./sglang_profile

# Without --profile-steps: /stop_profile sent automatically after benchmark
python -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:30000 \
  --model /path/to/your/model \
  --tokenizer /path/to/your/model \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 100 \
  --num-prompts 10 \
  --profile \
  --profile-output-dir ./sglang_profile
```

<Note>
  `--profile-steps N` sends `"num_steps": N` to the server's `/start_profile`, so
  the server auto-stops and parses data after N steps — bench\_serving skips
  sending `/stop_profile`.
</Note>

<Note>
  `bench_serving --profile` creates a timestamp subdirectory inside
  `--profile-output-dir` (e.g. `<output_dir>/<timestamp>/`). The output path is
  shown in the server log as `Profiling done. Traces are saved to: <path>`.
</Note>

**`bench_serving --profile` parameters:**

<table>
  <thead>
    <tr><th>Parameter</th><th>Description</th></tr>
  </thead>

  <tbody>
    <tr><td><code>--profile</code></td><td>Enable auto profiling start/stop</td></tr>
    <tr><td><code>--profile-steps N</code></td><td>Auto-stop after N steps (skips /stop\_profile)</td></tr>
    <tr><td><code>--profile-output-dir</code></td><td>Trace output directory</td></tr>
  </tbody>
</table>

#### Method D: sglang.profiler CLI

Use the `sglang.profiler` CLI module, which automatically sends
`/start_profile` and waits for completion. **Start `sglang.profiler` first,
then send inference requests** (otherwise there are no steps to capture and the
profiler will wait indefinitely).

```bash Command theme={null}
# Terminal 1: Start sglang.profiler first (sends /start_profile, then waits for completion)
python3 -m sglang.profiler \
  --url http://127.0.0.1:30000 \
  --output-dir ./my_profiles \
  --num-steps 3 \
  --cpu --gpu &
```

```bash Command theme={null}
# Terminal 2: Immediately send inference requests to provide steps for profiling
curl http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 32}}'
```

A simpler and more reliable approach is to use `bench_serving --profile`, which
handles both steps automatically:

```bash Command theme={null}
python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:30000 \
  --model /path/to/your/model \
  --tokenizer /path/to/your/model \
  --dataset-name random \
  --random-input-len 128 \
  --random-output-len 32 \
  --num-prompts 10 \
  --profile \
  --profile-steps 3 \
  --profile-output-dir ./my_profiles
```

<Note>
  `sglang.profiler` is essentially a CLI wrapper around the `/start_profile` API.
  Advanced options like `--profile-by-stage` are also supported. On Ascend NPU,
  trace flushing is asynchronous and may take a while — the CLI may occasionally
  block waiting for flush. If it times out, use Method B (API auto-stop) or
  Method C (bench\_serving --profile) instead.
</Note>

**`sglang.profiler` CLI parameters:**

<table>
  <thead>
    <tr><th>Parameter</th><th>Description</th></tr>
  </thead>

  <tbody>
    <tr><td><code>--url</code></td><td>SGLang server address</td></tr>

    <tr>
      <td><code>--output-dir</code></td>
      <td>Output directory (defaults to <code>SGLANG\_TORCH\_PROFILER\_DIR</code>)</td>
    </tr>

    <tr><td><code>--num-steps</code></td><td>Number of steps to profile</td></tr>

    <tr>
      <td><code>--profile-by-stage</code></td>
      <td>Profile prefill / decode stages separately</td>
    </tr>

    <tr><td><code>--profile-prefix</code></td><td>Trace filename prefix</td></tr>

    <tr>
      <td><code>--cpu</code> / <code>--gpu</code> / <code>--mem</code> / <code>--rpd</code></td>
      <td>Activity types to collect</td>
    </tr>
  </tbody>
</table>

### 3. Full Parameter Reference

All methods ultimately send a `/start_profile` request to the server. The full
set of supported parameters:

<table>
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Description</th>
      <th>Default</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td><code>output\_dir</code></td>

      <td>
        Output directory. Falls back to
        <code>SGLANG\_TORCH\_PROFILER\_DIR</code> or <code>/tmp</code>
      </td>

      <td><code>/tmp</code></td>
    </tr>

    <tr>
      <td><code>num\_steps</code></td>

      <td>
        Number of steps. If set, profiling auto-stops — no /stop\_profile needed
      </td>

      <td>None</td>
    </tr>

    <tr>
      <td><code>start\_step</code></td>

      <td>
        Step index to start profiling (inclusive), for skipping warmup
      </td>

      <td>0</td>
    </tr>

    <tr>
      <td><code>activities</code></td>

      <td>
        Activity types: CPU, GPU, MEM, RPD. On Ascend NPU, primarily CPU and GPU
      </td>

      <td><code>\["CPU", "GPU"]</code></td>
    </tr>

    <tr>
      <td><code>profile\_by\_stage</code></td>
      <td>Profile prefill and decode stages separately</td>
      <td><code>false</code></td>
    </tr>

    <tr>
      <td><code>with\_stack</code></td>

      <td>
        Record Python call stack. Also controllable via
        <code>SGLANG\_PROFILE\_WITH\_STACK</code>
      </td>

      <td><code>true</code></td>
    </tr>

    <tr>
      <td><code>record\_shapes</code></td>

      <td>
        Record operator input shapes. Also controllable via
        <code>SGLANG\_PROFILE\_RECORD\_SHAPES</code>
      </td>

      <td><code>true</code></td>
    </tr>

    <tr>
      <td><code>profile\_prefix</code></td>
      <td>Prefix for trace filenames</td>
      <td>None</td>
    </tr>

    <tr>
      <td><code>profile\_stages</code></td>

      <td>
        Stages to profile, e.g. <code>\["prefill", "decode"]</code>.
        Requires <code>profile\_by\_stage</code>
      </td>

      <td>None</td>
    </tr>
  </tbody>
</table>

### 4. Finding Output Files

**The server log explicitly indicates where traces are saved.** You can find
them via:

* **When profiling starts**: server log outputs
  `Profiling starts. Traces will be saved to: <path> (with profile id: <id>)`

```
[2026-05-19 13:23:15] Profiling starts. Traces will be saved to: /tmp/1779196995.6948605 (with profile id: 1779196995.6979997)
[2026-05-19 13:23:15] [WARNING] [350443] profiler.py: Invalid parameter export_type: None, reset it to text.
[2026-05-19 13:23:15] [WARNING] [350443] profiler.py: Invalid parameter export_type: None, reset it to text.
[2026-05-19 13:23:15] INFO:     127.0.0.1:40714 - "POST /start_profile HTTP/1.1" 200 OK
```

* **When profiling stops**: server log outputs
  `Profiling done. Traces are saved to: <path>`

```
[2026-05-19 13:23:17] Stop profiling...
[2026-05-19 13:23:17] [WARNING] [350443] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
[rank0]:[W519 13:23:17.084812760 compiler_depend.ts:3136] Warning: The indexFromRank 0is not equal indexFromCurDevice 4 , which might be normal if the number of devices on your collective communication server is inconsistent.Otherwise, you need to check if the current device is correct when calling the interface.If it's incorrect, it might have introduced an error. (function operator())
[2026-05-19 13:23:17] [INFO] [352725] profiler.py: Start parsing profiling data: /tmp/1779196995.6948605/localhost.localdomain_350443_20260519132315700_ascend_pt
[2026-05-19 13:23:22] [INFO] [352734] profiler.py: CANN profiling data parsed in a total time of 0:00:04.022310
[2026-05-19 13:23:32] [INFO] [352725] profiler.py: All profiling data parsed in a total time of 0:00:14.305669
[2026-05-19 13:23:32] Profiling done. Traces are saved to: /tmp/1779196995.6948605
```

* **CLI output**: `sglang.profiler` outputs `Dump profiling traces to <path>`

```
Dump profiling traces to /tmp/1779243331.3219
Waiting for 10 steps and the trace to be flushed.... (profile_by_stage=False)
```

The directory structure is
`<output_dir>/<hostname>_<pid>_<timestamp>_ascend_pt/`. When using Method C
(`bench_serving --profile`), a timestamp subdirectory is added:
`<output_dir>/<timestamp>/`. Always check the server log for the exact path:
`Profiling done. Traces are saved to: <path>`.

### 5. Viewing Results

After profiling stops (either `/stop_profile` returns or `num_steps`
auto-triggers), the server **automatically parses the raw data**. The
`ASCEND_PROFILER_OUTPUT` directory directly contains the following visualization
files — **no need to manually call `analyse()`**:

<table>
  <thead>
    <tr><th>File</th><th>Description</th></tr>
  </thead>

  <tbody>
    <tr>
      <td><code>trace\_view\.json</code></td>

      <td>
        Chrome Tracing format. Open in
        <a href="https://www.hiascend.com/document/detail/zh/mindstudio/81RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html">MindStudio Insight</a>
      </td>
    </tr>

    <tr><td><code>analysis.db</code></td><td>Database-format performance data</td></tr>

    <tr>
      <td><code>ascend\_pytorch\_profiler\_0.db</code></td>
      <td>Database-format performance data</td>
    </tr>

    <tr><td><code>kernel\_details.csv</code></td><td>Kernel-level data</td></tr>
    <tr><td><code>operator\_details.csv</code></td><td>Operator-level data</td></tr>
    <tr><td><code>step\_trace\_time.csv</code></td><td>Step trace timing data</td></tr>
  </tbody>
</table>

<Note>
  `trace_view.json` can also be opened using Chrome's built-in
  `chrome://tracing` or [Perfetto UI](https://ui.perfetto.dev/).
</Note>

<Note>
  If you need to merge distributed trace files in a multi-node deployment, set
  `"merge_profiles": true` in the `/start_profile` request. Note: on Ascend NPU,
  the merger has limited support for the `*_ascend_pt` format — check
  `trace_view.json` on each node individually. See
  [Benchmark and Profiling](/docs/developer_guide/benchmark_and_profiling#profiler-trace-merger-for-distributed-traces)
  for details.
</Note>

### 6. Re-parsing Raw Data (Optional)

If you need to **re-parse existing data** with different parameters, or if
profiling was interrupted and `ASCEND_PROFILER_OUTPUT` was not auto-generated,
use `torch_npu`'s `analyse()` tool:

```python theme={null}
from torch_npu.profiler.profiler import analyse
analyse("./sglang_profile/<hostname>_*_ascend_pt/")
```

<Note>
  Normally **no need** to manually run `analyse()` — the server already parses
  data automatically. Only use this for re-parsing or handling interrupted data.
</Note>

## Best Practices

### Common Notes

* **Finding output**: Check the server log for
  `Profiling starts. Traces will be saved to: <path>` and
  `Profiling done. Traces are saved to: <path>`, or `sglang.profiler` output for
  `Dump profiling traces to <path>`.
* **Control trace file size**: Reduce the number of requests and output length
  using `--num-prompts` and `--random-output-len` to avoid trace files too large
  for browsers.
* **Warmup iterations**: Set `start_step` to skip the first few warmup steps and
  capture performance data under steady state.
* **Profile step count**: Large values for `num_steps` or `--profile-steps` can
  lead to lengthy profiling data parsing times. Reduce these values
  appropriately when you only need a quick overview.
* **CUDA Graph impact**: To see the full Python call stack → operator mapping in
  traces, add `--disable-cuda-graph` when starting the server. Note that this
  reduces decode performance — only use during profiling. To analyze CUDA Graph
  capture specifically, use `--enable-profile-cuda-graph` — traces are saved to
  `SGLANG_TORCH_PROFILER_DIR/graph_capture_profile/`.
* **Multi-node deployment**: In multi-node environments, performance data is
  distributed across nodes. On Ascend NPU, the `merge_profiles` feature has
  limited support — check `*_ascend_pt/ASCEND_PROFILER_OUTPUT/trace_view.json`
  on each node individually. In PD disaggregation mode, prefill and decode
  workers must be profiled separately — see
  [Profile In PD Disaggregation Mode](/docs/developer_guide/benchmark_and_profiling#profile-in-pd-disaggregation-mode).

## See Also

* [SGLang Benchmark and Profiling](/docs/developer_guide/benchmark_and_profiling)
  — General SGLang profiling guide
* [Ascend NPU Quickstart](/docs/hardware-platforms/ascend-npus/ascend_npu_quick_start)
  — Ascend NPU environment setup
* [Ascend NPU Optimization](/docs/hardware-platforms/ascend-npus/ascend_npu_optimization)
  — Ascend NPU optimization parameters
* [Ascend NPU Performance Testing](/docs/hardware-platforms/ascend-npus/ascend_npu_performance_testing)
  — Ascend NPU performance benchmarking
* [Ascend NPU Environment Variables](/docs/hardware-platforms/ascend-npus/ascend_npu_environment_variables)
  — Environment variable reference
