Skip to main content
This document describes how to run SGLang on AMD GPUs. If you encounter issues or have questions, please open an issue.

System Configuration

When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning: NOTE: We strongly recommend reading these docs and guides entirely to fully utilize your system. Below are a few key settings to confirm or enable for SGLang:

Update GRUB Settings

In /etc/default/grub, append the following to GRUB_CMDLINE_LINUX:
GRUB Configuration
pci=realloc=off iommu=pt
Afterward, run sudo update-grub (or your distro’s equivalent) and reboot.

Disable NUMA Auto-Balancing

Disable NUMA
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
You can automate or verify this change using this helpful script. Again, please go through the entire documentation to confirm your system is using the recommended configuration.

Install SGLang

You can install SGLang using one of the methods below.

Install from Source

Command
# Use the last release branch
git clone -b v0.5.9 https://github.com/sgl-project/sglang.git
cd sglang

# Compile sgl-kernel
pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install

# Install sglang python package along with diffusion support
cd ..
rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
pip install -e "python[all_hip]"
The docker images are available on Docker Hub at lmsysorg/sglang, built from rocm.Dockerfile. The steps below show how to build and use an image.
  1. Build the docker image. If you use pre-built images, you can skip this step and replace sglang_image with the pre-built image names in the steps below.
    Command
    docker build -t sglang_image -f rocm.Dockerfile .
    
  2. Create a convenient alias.
    Command
    alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \
        --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
        --security-opt seccomp=unconfined \
        -v $HOME/dockerx:/dockerx \
        -v /data:/data'
    
    If you are using RDMA, please note that:
    • --network host and --privileged are required by RDMA. If you don’t need RDMA, you can remove them.
    • You may need to set NCCL_IB_GID_INDEX if you are using RoCE, for example: export NCCL_IB_GID_INDEX=3.
  3. Launch the server. NOTE: Replace <secret> below with your huggingface hub token.
    Command
    drun -p 30000:30000 \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        --env "HF_TOKEN=<secret>" \
        sglang_image \
        python3 -m sglang.launch_server \
        --model-path NousResearch/Meta-Llama-3.1-8B \
        --host 0.0.0.0 \
        --port 30000
    
  4. To verify the utility, you can run a benchmark in another terminal or refer to other docs to send requests to the engine.
    Command
    drun sglang_image \
        python3 -m sglang.bench_serving \
        --backend sglang \
        --dataset-name random \
        --num-prompts 4000 \
        --random-input 128 \
        --random-output 128
    
With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities.

Quantization on AMD GPUs

The Quantization documentation has a full compatibility matrix. The short version: FP8, AWQ, MXFP4, W8A8, GPTQ, compressed-tensors, Quark, and petit_nvfp4 (NVFP4 on ROCm via Petit) all work on AMD. Methods that depend on Marlin or NVIDIA-specific kernels (awq_marlin, gptq_marlin, gguf, modelopt_fp8, modelopt_fp4) do not. A few things to keep in mind:
  • FP8 works via Aiter or Triton. Pre-quantized FP8 models like DeepSeek-V3/R1 work out of the box.
  • AWQ uses Triton dequantization kernels on AMD. The faster Marlin path is not available.
  • MXFP4 requires CDNA3/CDNA4 and SGLANG_USE_AITER=1.
  • petit_nvfp4 enables NVFP4 models (e.g., Llama 3.3 70B FP4) on MI250/MI300X via Petit. Install with pip install petit-kernel; no --quantization flag needed when loading pre-quantized NVFP4 models.
  • quark_int4fp8_moe is an AMD-only online quantization method for MoE models on CDNA3/CDNA4.
Several of these backends are accelerated by Aiter. Enable it with:
Command
export SGLANG_USE_AITER=1
Example — serving an AWQ model:
Command
python3 -m sglang.launch_server \
    --model-path hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 \
    --trust-remote-code \
    --port 30000 --host 0.0.0.0
Example — FP8 online quantization:
Command
python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --quantization fp8 \
    --port 30000 --host 0.0.0.0

Examples

Running DeepSeek-V3

The only difference when running DeepSeek-V3 is in how you start the server. Here’s an example command:
Command
drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    --env "HF_TOKEN=<secret>" \
    sglang_image \
    python3 -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V3 \ # <- here
    --tp 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 30000
Running DeepSeek-R1 on a single NDv5 MI300X VM could also be a good reference.

Running Llama3.1

Running Llama3.1 is nearly identical to running DeepSeek-V3. The only difference is in the model specified when starting the server, shown by the following example command:
Command
drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    --env "HF_TOKEN=<secret>" \
    sglang_image \
    python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here
    --tp 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 30000

Warmup Step

When the server displays The server is fired up and ready to roll!, it means the startup is successful.