> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# XPU

The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on Intel GPU, [see more context about Intel GPU support within PyTorch ecosystem](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html).

Specifically, SGLang is optimized for [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/242616/intel-arc-pro-b-series-graphics.html) and [
Intel® Arc™ B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/240391/intel-arc-b-series-graphics.html).

## Optimized Model List

A list of LLMs have been optimized on Intel GPU, and more are on the way:

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
  <colgroup>
    <col style={{width: "50%"}} />

    <col style={{width: "50%"}} />
  </colgroup>

  <thead>
    <tr style={{borderBottom: "2px solid #d55816"}}>
      <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Model Name</th>
      <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>BF16</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-3.2-3B</td>
      <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>[meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)</td>
    </tr>

    <tr>
      <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-3.1-8B</td>
      <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>[meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)</td>
    </tr>

    <tr>
      <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Qwen2.5-1.5B</td>
      <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>[Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B)</td>
    </tr>
  </tbody>
</table>

**Note:** The model identifiers listed in the table above
have been verified on [Intel® Arc™ B580 Graphics](https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html).

## Installation

### Install From Source

Currently SGLang XPU only supports installation from source. Please refer to ["Getting Started on Intel GPU"](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html) to install XPU dependency.

```bash Command theme={null}
# Create and activate a conda environment
conda create -n sgl-xpu python=3.12 -y
conda activate sgl-xpu

# Set PyTorch XPU as primary pip install channel to avoid installing the larger CUDA-enabled version and prevent potential runtime issues.
pip3 install torch==2.12.0+xpu torchao==0.17.0+xpu torchvision==0.27.0+xpu torchaudio==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu
pip3 install xgrammar --no-deps # xgrammar will introduce CUDA-enabled triton which might conflict with XPU
pip3 install apache-tvm-ffi # xgrammar requires apache-tvm-ffi

# Clone the SGLang code
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout <YOUR-DESIRED-VERSION>

# Use dedicated toml file
cd python
cp pyproject_xpu.toml pyproject.toml
# Install SGLang dependent libs, and build SGLang main package
pip install --upgrade pip setuptools
pip install -v . --extra-index-url https://download.pytorch.org/whl/xpu
```

### Install Using Docker

[The SGLang XPU Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xpu.Dockerfile) is provided to facilitate the installation.
Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).

```bash Command theme={null}
# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker

# Build the docker image
docker build -t sglang-xpu:latest -f xpu.Dockerfile .

# Initiate a docker container
docker run \
    -it \
    --privileged \
    --ipc=host \
    --network=host \
    --user root \
    --group-add $(getent group video | cut -d: -f3) \
    --device /dev/dri \
    -v /dev/dri/by-path:/dev/dri/by-path \
    -v /dev/shm:/dev/shm \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 30000:30000 \
    -e "HF_TOKEN=<secret>" \
    sglang-xpu:latest /bin/bash
```

## Launch of the Serving Engine

Example command to launch SGLang serving:

```bash theme={null}
sglang serve                         \
    --model-path <MODEL_ID_OR_PATH>  \
    --trust-remote-code              \
    --disable-overlap-schedule       \
    --device xpu                     \
    --host 0.0.0.0                   \
    --tp 2                           \   # using multi GPUs
    --attention-backend intel_xpu    \   # using intel optimized XPU attention backend
    --page-size                      \   # intel_xpu attention backend supports [32, 64, 128]
```

## Benchmarking with Requests

You can benchmark the performance via the `bench_serving` script.
Run the command in another terminal.

```bash theme={null}
python -m sglang.bench_serving   \
    --dataset-name random        \
    --random-input-len 1024      \
    --random-output-len 1024     \
    --num-prompts 1              \
    --request-rate inf           \
    --random-range-ratio 1.0
```

The detail explanations of the parameters can be looked up by the command:

```bash theme={null}
python -m sglang.bench_serving -h
```

Additionally, the requests can be formed with
[OpenAI Completions API](../basic_usage/openai_api_completions)
and sent via the command line (e.g. using `curl`) or via your own script.

## XPU Graph \[Experimental]

SGLang enables XPU graph capture to reduce per-step kernel-launch overhead.

| Phase   | Backend        | Mechanism                                                                                                            | Default          |
| ------- | -------------- | -------------------------------------------------------------------------------------------------------------------- | ---------------- |
| Decode  | `full`         | One `torch.xpu.XPUGraph` per batch size, captured on startup                                                         | **Off** (opt-in) |
| Prefill | `tc_piecewise` | `torch.compile` + XPU graph, one graph segment per token-length bucket                                               | **Off** (opt-in) |
| Prefill | `breakable`    | Segmented `torch.xpu.XPUGraph` capture/replay (no `torch.compile`); eager break points at attention / MoE boundaries | **Off** (opt-in) |

### Enable Decode Graph

Decode graph capture is **opt-in** on XPU. Enable it explicitly:

```bash theme={null}
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-decode full
```

### Enable Prefill Graph

Prefill graph capture is **opt-in** on XPU and must be enabled explicitly.
Two backends are available: `tc_piecewise` and `breakable`.

#### tc\_piecewise

Uses `torch.compile` plus an XPU graph, one graph segment per token-length
bucket:

```bash theme={null}
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill tc_piecewise
```

By default the prefill subgraphs are compiled with `eager` mode. Switch to
`inductor` for higher-quality generated code at the cost of longer startup:

```bash theme={null}
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill tc_piecewise \
    --cuda-graph-tc-compiler inductor
```

#### breakable

Captures the transformer stack as segmented `XPUGraph`s with eager break points
at attention / MoE boundaries, without `torch.compile`:

```bash theme={null}
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill breakable
```

You can also configure both phases together with a single `--cuda-graph-config` JSON argument:

```bash theme={null}
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-config '{"decode":{"backend":"full"},"prefill":{"backend":"tc_piecewise","tc_compiler":"eager"}}'
```

### Enable torch.compile for Decode

`--enable-torch-compile` adds a `torch.compile` pass on top of the decode
XPU graph: the model forward is compiled first, and the compiled forward is
then captured as an `XPUGraph`. This can reduce per-kernel overhead further
but increases startup time.

```bash theme={null}
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --enable-torch-compile
```

> **Note:** `--enable-torch-compile` is mutually exclusive with the prefill
> `tc_piecewise` graph (the compatibility rules auto-disable it). Use them
> separately or lock the prefill backend explicitly via `--cuda-graph-config`
> if you need both.

### Disable XPU Graph

Both phases are disabled by default. To explicitly disable them anyway:

```bash theme={null}
# Disable decode graph (already off by default; explicit form)
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-decode=disabled

# Disable prefill graph (already off by default; explicit form)
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill=disabled

# Disable both phases
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-decode=disabled \
    --cuda-graph-backend-prefill=disabled
```

### Customize Capture Buckets

By default, prefill capture sizes are derived from `--chunked-prefill-size`.
To specify explicit token-length buckets:

```bash theme={null}
python -m sglang.launch_server \
    --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill tc_piecewise \
    --cuda-graph-bs-prefill 64 128 256 512
```

To specify explicit decode graph batch sizes:

```bash theme={null}
python -m sglang.launch_server \
    --model-path <MODEL> --device xpu \
    --cuda-graph-bs-decode 1 2 4 8
```

### Server Args

| Argument                       | XPU allowed values                      | Default      | Description                                                                                                                                                             |
| ------------------------------ | --------------------------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--cuda-graph-backend-decode`  | `full`, `disabled`                      | `disabled`   | Backend for the decode phase. Only `full` is supported on XPU. Set to `full` to enable.                                                                                 |
| `--cuda-graph-backend-prefill` | `tc_piecewise`, `breakable`, `disabled` | `disabled`\* | Backend for the prefill phase. Set to `tc_piecewise` or `breakable` explicitly to enable.                                                                               |
| `--cuda-graph-tc-compiler`     | `eager`, `inductor`                     | `eager`      | Compiler for `tc_piecewise` prefill subgraphs. `inductor` produces more optimized code but has longer startup.                                                          |
| `--cuda-graph-bs-prefill`      | list of ints                            | auto         | Explicit token-length buckets to capture for prefill.                                                                                                                   |
| `--cuda-graph-bs-decode`       | list of ints                            | auto         | Explicit batch sizes to capture for decode.                                                                                                                             |
| `--cuda-graph-config`          | JSON string                             | —            | One-shot JSON config for both phases, e.g. `'{"decode":{"backend":"full"},"prefill":{"backend":"tc_piecewise","tc_compiler":"eager"}}'`. Overrides all per-phase flags. |
| `--disable-decode-cuda-graph`  | —                                       | `False`      | Shorthand for `--cuda-graph-backend-decode=disabled`.                                                                                                                   |
| `--disable-prefill-cuda-graph` | —                                       | `False`      | Shorthand for `--cuda-graph-backend-prefill=disabled`.                                                                                                                  |
| `--enable-torch-compile`       | —                                       | `False`      | Apply `torch.compile` on top of the decode XPU graph for further kernel optimization.                                                                                   |
| `--torch-compile-max-bs`       | int                                     | `32`         | Maximum batch size compiled by `torch.compile` when `--enable-torch-compile` is set.                                                                                    |

\* Prefill graph is auto-disabled on XPU unless you lock the backend explicitly
via `--cuda-graph-backend-prefill` or `--cuda-graph-config`.

### Limitations

| Feature                                          | Status              |
| ------------------------------------------------ | ------------------- |
| Memory saver (`--enable-memory-saver`)           | Not yet supported   |
| Two-batch overlap (`--enable-two-batch-overlap`) | Not yet supported   |
| Speculative decoding                             | Not yet implemented |

## Prefill-Decode (P/D) Disaggregation on Intel XPU \[Experimental]

SGLang supports prefill-decode disaggregation on Intel XPU using the [NIXL](https://github.com/ai-dynamo/nixl) KV-transfer backend.

**Tested models:**

|                                    Model                                    |                                               Notes                                              |
| :-------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------: |
|          [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)          | Used in integration tests; verified on Intel XPU with homogeneous P/D (XPU prefill + XPU decode) |
| [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |               Verified on Intel XPU with homogeneous P/D (XPU prefill + XPU decode)              |

**Prerequisites:** `pip install nixl sglang-router`

**Start the prefill server (GPU 0):**

```bash theme={null}
ZE_AFFINITY_MASK=0 UCX_POSIX_USE_PROC_LINK=n python -m sglang.launch_server \
    --model-path Qwen/Qwen3-0.6B --trust-remote-code --device xpu \
    --disaggregation-mode prefill --disaggregation-transfer-backend nixl \
    --disaggregation-bootstrap-port 12335 --host 0.0.0.0 --port 30000
```

**Start the decode server (GPU 1):**

```bash theme={null}
ZE_AFFINITY_MASK=1 UCX_POSIX_USE_PROC_LINK=n python -m sglang.launch_server \
    --model-path Qwen/Qwen3-0.6B --trust-remote-code --device xpu \
    --disaggregation-mode decode --disaggregation-transfer-backend nixl \
    --disaggregation-bootstrap-port 12335 --host 0.0.0.0 --port 30001
```

**Start the router:**

```bash theme={null}
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://127.0.0.1:30000 \
    --decode  http://127.0.0.1:30001 \
    --host 0.0.0.0 --port 8000
```

**Send a request:**

```bash theme={null}
curl http://127.0.0.1:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "Qwen/Qwen3-0.6B", "prompt": "The capital of France is", "max_tokens": 32}'
```

> **Note:** `UCX_POSIX_USE_PROC_LINK=n` is required on Intel XPU to avoid UCX shared-memory transport issues.
Model Name	BF16
Llama-3.2-3B	[meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
Llama-3.1-8B	[meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
Qwen2.5-1.5B	[Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B)