System Configuration
When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:- AMD MI300X Tuning Guides
- LLM inference performance validation on AMD Instinct MI300X
- AMD Instinct MI300X System Optimization
- AMD Instinct MI300X Workload Optimization
- Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X
Update GRUB Settings
In/etc/default/grub, append the following to GRUB_CMDLINE_LINUX:
GRUB Configuration
sudo update-grub (or your distro’s equivalent) and reboot.
Disable NUMA Auto-Balancing
Disable NUMA
Install SGLang
You can install SGLang using one of the methods below.Install from Source
Command
Install Using Docker (Recommended)
The docker images are available on Docker Hub at lmsysorg/sglang, built from rocm.Dockerfile. The steps below show how to build and use an image.-
Build the docker image.
If you use pre-built images, you can skip this step and replace
sglang_imagewith the pre-built image names in the steps below.Command -
Create a convenient alias.
If you are using RDMA, please note that:Command
--network hostand--privilegedare required by RDMA. If you don’t need RDMA, you can remove them.- You may need to set
NCCL_IB_GID_INDEXif you are using RoCE, for example:export NCCL_IB_GID_INDEX=3.
-
Launch the server.
NOTE: Replace
<secret>below with your huggingface hub token.Command -
To verify the utility, you can run a benchmark in another terminal or refer to other docs to send requests to the engine.
Command
Quantization on AMD GPUs
The Quantization documentation has a full compatibility matrix. The short version: FP8, AWQ, MXFP4, W8A8, GPTQ, compressed-tensors, Quark, and petit_nvfp4 (NVFP4 on ROCm via Petit) all work on AMD. Methods that depend on Marlin or NVIDIA-specific kernels (awq_marlin, gptq_marlin, gguf, modelopt_fp8, modelopt_fp4) do not.
A few things to keep in mind:
- FP8 works via Aiter or Triton. Pre-quantized FP8 models like DeepSeek-V3/R1 work out of the box.
- AWQ uses Triton dequantization kernels on AMD. The faster Marlin path is not available.
- MXFP4 requires CDNA3/CDNA4 and
SGLANG_USE_AITER=1. petit_nvfp4enables NVFP4 models (e.g., Llama 3.3 70B FP4) on MI250/MI300X via Petit. Install withpip install petit-kernel; no--quantizationflag needed when loading pre-quantized NVFP4 models.quark_int4fp8_moeis an AMD-only online quantization method for MoE models on CDNA3/CDNA4.
Command
Command
Command
Examples
Running DeepSeek-V3
The only difference when running DeepSeek-V3 is in how you start the server. Here’s an example command:Command
Running Llama3.1
Running Llama3.1 is nearly identical to running DeepSeek-V3. The only difference is in the model specified when starting the server, shown by the following example command:Command
Warmup Step
When the server displaysThe server is fired up and ready to roll!, it means the startup is successful.