Optimized Model List
A list of popular LLMs are optimized and run efficiently on CPU, including the most notable open-source models like Llama series, Qwen series, and DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus.| Model Name | BF16 | W8A8_INT8 | FP8 |
|---|---|---|---|
| DeepSeek-R1 | meituan/DeepSeek-R1-Channel-INT8 | deepseek-ai/DeepSeek-R1 | |
| DeepSeek-V3.1-Terminus | IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8 | deepseek-ai/DeepSeek-V3.1-Terminus | |
| Llama-3.2-3B | meta-llama/Llama-3.2-3B-Instruct | RedHatAI/Llama-3.2-3B-quantized.w8a8 | |
| Llama-3.1-8B | meta-llama/Llama-3.1-8B-Instruct | RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 | |
| QwQ-32B | RedHatAI/QwQ-32B-quantized.w8a8 | ||
| DeepSeek-Distilled-Llama | RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8 | ||
| Qwen3-235B | Qwen/Qwen3-235B-A22B-FP8 |
Installation
Install Using Docker
It is recommended to use Docker for setting up the SGLang environment. A Dockerfile is provided to facilitate the installation. Replace<secret> below with your HuggingFace access token.
Command
Install From Source
If you prefer to install SGLang in a bare metal environment, the setup process is as follows: Please install the required packages and libraries beforehand if they are not already present on your system. You can refer to the Ubuntu-based installation commands in the Dockerfile for guidance.- Install
uvpackage manager, then create and activate a virtual environment:
Command
- Create a config file to direct the installation channel
(a.k.a. index-url) of
torchrelated packages:
Command
vim, paste the following content into the created file
vim, press ‘esc’ to exit insert mode, then ‘:x+Enter’),
and set it as the default uv config.
Command
- Clone the
sglangsource code and build the packages
Command
- Set the required environment variables
Command
-
Note that the environment variable
SGLANG_USE_CPU_ENGINE=1is required to enable the SGLang service with the CPU engine. -
If you encounter code compilation issues during the
sgl-kernelbuilding process, please check yourgccandg++versions and upgrade them if they are outdated. It is recommended to usegcc-13andg++-13as they have been verified in the official Docker container. -
The system library path is typically located in one of the following directories:
~/.local/lib/,/usr/local/lib/,/usr/local/lib64/,/usr/lib/,/usr/lib64/and/usr/lib/x86_64-linux-gnu/. In the above example commands,/usr/lib/x86_64-linux-gnuis used. Please adjust the path according to your server configuration. -
It is recommended to add the following to your
~/.bashrcfile to avoid setting these variables every time you open a new terminal:Command
Launch of the Serving Engine
Example command to launch SGLang serving:Launch Server
-
For running W8A8 quantized models, please add the flag
--quantization w8a8_int8. -
The flag
--tp 6specifies that tensor parallelism will be applied using 6 ranks (TP6). The number of TP specified is how many TP ranks will be used during the execution. On a CPU platform, a TP rank means a sub-NUMA cluster (SNC). Usually we can get the SNC information (How many available) from the Operating System with e.g.lscpucommand. If the specified TP rank number differs from the total SNC count, the system will automatically utilize the firstnSNCs. Note thatncannot exceed the total SNC number, doing so will result in an error.SGLANG_CPU_OMP_THREADS_BINDallows explicit control of CPU cores for each tensor parallel (TP) rank. example 1: Run SGLang service with TP=6, using the first 40 cores of each SNC on a Xeon® 6980P server, which has 43-43-42 cores on the 3 SNCs of a socket, we should set:This configuration is equivalent to:Command- rank 0:
numactl -C 0-39 -m 0 - rank 1:
numactl -C 43-82 -m 1 - rank 2:
numactl -C 86-125 -m 2 - rank 3:
numactl -C 128-167 -m 3 - rank 4:
numactl -C 171-210 -m 4 - rank 5:
numactl -C 214-253 -m 5
This configuration is equivalent to:Command- rank 0:
numactl -C 0-95 -m 0-2 - rank 1:
numactl -C 96-191 -m 3-5
--max-total-tokensto avoid the out-of-memory error. - rank 0:
-
For optimizing decoding with torch.compile, please add the flag
--enable-torch-compile. To specify the maximum batch size when usingtorch.compile, set the flag--torch-compile-max-bs. For example,--enable-torch-compile --torch-compile-max-bs 4means usingtorch.compileand setting the maximum batch size to 4. -
A warmup step is automatically triggered when the service is started.
The server is ready when you see the log
The server is fired up and ready to roll!.
Benchmarking with Requests
You can benchmark the performance via thebench_serving script.
Run the command in another terminal. An example command would be:
Run Benchmark
Benchmark Help
curl) or through your own scripts.
Example Usage Commands
Large Language Models can range from fewer than 1 billion to several hundred billion parameters. Dense models larger than 20B are expected to run on flagship 6th Gen Intel® Xeon® processors with dual sockets and a total of 6 sub-NUMA clusters. Dense models of approximately 10B parameters or fewer, or MoE (Mixture of Experts) models with fewer than 10B activated parameters, can run on more common 4th generation or newer Intel® Xeon® processors, or utilize a single socket of the flagship 6th Gen Intel® Xeon® processors.Example: Running DeepSeek-V3.1-Terminus
An example command to launch service of W8A8_INT8 DeepSeek-V3.1-Terminus on a Xeon® 6980P server:W8A8_INT8
FP8
--torch-compile-max-bs to the maximum desired batch size for your deployment,
which can be up to 16. The value 4 in the examples is illustrative.
Example: Running Llama-3.2-3B
An example command to launch service of Llama-3.2-3B with BF16 precision:BF16
W8A8_INT8
--torch-compile-max-bs and --tp settings are examples that should be adjusted for your setup.
For instance, use --tp 3 to utilize 1 socket with 3 sub-NUMA clusters on an Intel® Xeon® 6980P server.
Once the server have been launched, you can test it using the bench_serving command or create
your own commands or scripts following the benchmarking example.