Installation - SGLang Documentation

You can install SGLang using one of the methods below. This page primarily applies to common NVIDIA GPU platforms. For other or newer platforms, please refer to the dedicated pages for AMD GPUs, Apple Metal, Intel Xeon CPUs, Google TPU, NVIDIA DGX Spark, NVIDIA Jetson, Ascend NPUs, and Intel XPU.

Prerequisites: Python 3.10 or higher.

Method 1: With pip or uv

It is recommended to use uv for faster installation:

Command

pip install --upgrade pip
pip install uv
uv pip install --prerelease=allow sglang

The major version of Cuda is 13 by default. To install sglang under Cuda 12 with pip or uv, please try the following commands:

Command

pip install --upgrade pip
pip install uv
uv pip install --prerelease=allow sglang
uv pip install --force-reinstall  torch==2.11.0 torchaudio==2.11.0 torchvision --index-url https://download.pytorch.org/whl/cu129
uv pip install --force-reinstall sglang-kernel --index-url https://docs.sglang.ai/whl/cu129/
uv pip install --force-reinstall sgl-deep-gemm --index-url https://docs.sglang.ai/whl/cu129/ --no-deps

Nightly builds

To pick up the latest features and fixes before the next stable release, install a nightly build. Nightly wheels are built from the latest main and published to the SGLang wheel index. Add that index with --extra-index-url, and combine --prerelease=allow with --index-strategy unsafe-best-match so uv considers the nightly (pre-release) version alongside PyPI:

Command

pip install --upgrade pip
pip install uv
uv pip install --prerelease=allow --index-strategy unsafe-best-match --extra-index-url https://docs.sglang.ai/whl/cu130/ sglang

To install a nightly build under Cuda 12, swap the index to cu129:

Command

pip install --upgrade pip
pip install uv
uv pip install --prerelease=allow --index-strategy unsafe-best-match --extra-index-url https://docs.sglang.ai/whl/cu129/ sglang

Quick fixes to common problems

If you encounter OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root with either of the following solutions:
1. Use export CUDA_HOME=/usr/local/cuda-<your-cuda-version> to set the CUDA_HOME environment variable.
2. Install FlashInfer first following FlashInfer installation doc, then install SGLang as described above.

Method 2: From source

Command

# Use the last release branch
git clone -b v0.5.15 https://github.com/sgl-project/sglang.git
cd sglang

# Install the python packages
pip install --upgrade pip
pip install -e "python"

Quick fixes to common problems

If you want to develop SGLang, you can try the dev docker image. Please refer to setup docker container. The docker image is lmsysorg/sglang:dev.

Method 3: Using docker

The docker images are available on Docker Hub at lmsysorg/sglang, built from Dockerfile. Replace <secret> below with your huggingface hub token.

latest and dev are mutable tags: latest always points at the newest stable release, while dev is rebuilt daily from the latest main and includes build/development tools. Because they are overwritten over time, pin an immutable version tag for reproducible deployments — e.g. lmsysorg/sglang:v0.5.15. Browse all released versions on Docker Hub.

Command

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

For production deployments, use the runtime variant which is significantly smaller (~40% reduction) by excluding build tools and development dependencies:

Command

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest-runtime \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

You can also find the nightly docker images here. Notes:

SGLang is shipped with CUDA 13 environment by default. To run SGLang on CUDA 12 environment, please use images with -cu12 or -cu129 suffix, such as lmsysorg/sglang:latest-cu129 or lmsysorg/sglang:dev-cu12.

Method 4: Using Kubernetes

Please check out OME, a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).

Option 1: For single node serving (typically when the model size fits into GPUs on one node) Execute command kubectl apply -f docker/k8s-sglang-service.yaml, to create k8s deployment and service, with llama-31-8b as example.
Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as DeepSeek-R1) Modify the LLM model path and arguments as necessary, then execute command kubectl apply -f docker/k8s-sglang-distributed-sts.yaml, to create two nodes k8s statefulset and serving service.

Method 5: Using docker compose

This method is recommended if you plan to serve it as a service. A better approach is to use the k8s-sglang-service.yaml.

Copy the compose.yml to your local machine
Execute the command docker compose up -d in your terminal.

Method 6: Run on Kubernetes or Clouds with SkyPilot

To deploy on Kubernetes or 12+ clouds, you can use SkyPilot.

Install SkyPilot and set up Kubernetes cluster or cloud access: see SkyPilot’s documentation.
Deploy on your own infra with a single command and get the HTTP API endpoint:

SkyPilot YAML: sglang.yaml

Config

# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

Command

# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang

To further scale up your deployment with autoscaling and failure recovery, check out the SkyServe + SGLang guide.

Method 7: Run on AWS SageMaker

To deploy on SGLang on AWS SageMaker, check out AWS SageMaker InferenceAmazon Web Services provide supports for SGLang containers along with routine security patching. For available SGLang containers, check out AWS SGLang DLCs.To deploy a pre-built SGLang Deep Learning Container without building your own image, see Amazon SageMaker AI.To host a model with your own container, follow the following steps:

Build a docker container with sagemaker.Dockerfile alongside the serve script.
Push your container onto AWS ECR.

Dockerfile Build Script: build-and-push.sh

Command

#!/bin/bash
AWS_ACCOUNT="<YOUR_AWS_ACCOUNT>"
AWS_REGION="<YOUR_AWS_REGION>"
REPOSITORY_NAME="<YOUR_REPOSITORY_NAME>"
IMAGE_TAG="<YOUR_IMAGE_TAG>"

ECR_REGISTRY="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com"
IMAGE_URI="${ECR_REGISTRY}/${REPOSITORY_NAME}:${IMAGE_TAG}"

echo "Starting build and push process..."

# Login to ECR
echo "Logging into ECR..."
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${ECR_REGISTRY}

# Build the image
echo "Building Docker image..."
docker build -t ${IMAGE_URI} -f sagemaker.Dockerfile .

echo "Pushing ${IMAGE_URI}"
docker push ${IMAGE_URI}

echo "Build and push completed successfully!"

Deploy a model for serving on AWS Sagemaker, refer to deploy_and_serve_endpoint.py. For more information, check out sagemaker-python-sdk.
1. By default, the model server on SageMaker will run with the following command: python3 -m sglang.launch_server --model-path opt/ml/model --host 0.0.0.0 --port 8080. This is optimal for hosting your own model with SageMaker.
2. To modify your model serving parameters, the serve script allows for all available options within python3 -m sglang.launch_server --help cli by specifying environment variables with prefix SM_SGLANG_.
3. The serve script will automatically convert all environment variables with prefix SM_SGLANG_ from SM_SGLANG_INPUT_ARGUMENT into --input-argument to be parsed into python3 -m sglang.launch_server cli.
4. For example, to run Qwen/Qwen3-0.6B with reasoning parser, simply add additional environment variables SM_SGLANG_MODEL_PATH=Qwen/Qwen3-0.6B and SM_SGLANG_REASONING_PARSER=qwen3.

Common Notes

FlashInfer is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding --attention-backend triton --sampling-backend pytorch and open an issue on GitHub.
To reinstall flashinfer locally, use the following command: pip3 install --upgrade flashinfer-python --force-reinstall --no-deps and then delete the cache with rm -rf ~/.cache/flashinfer.

​Method 1: With pip or uv

​Nightly builds

​Quick fixes to common problems

​Method 2: From source

​Method 3: Using docker

​Method 4: Using Kubernetes

​Method 5: Using docker compose

​Method 6: Run on Kubernetes or Clouds with SkyPilot

​Method 7: Run on AWS SageMaker

​Common Notes

Method 1: With pip or uv

Nightly builds

Quick fixes to common problems

Method 2: From source

Method 3: Using docker

Method 4: Using Kubernetes

Method 5: Using docker compose

Method 6: Run on Kubernetes or Clouds with SkyPilot

Method 7: Run on AWS SageMaker

Common Notes