Skip to main content
Deploy SGLang on Amazon SageMaker AI endpoints using the AWS Deep Learning Container (DLC) for SGLang. The SageMaker image variant accepts model configuration via environment variables and serves on port 8080. This guide uses the pre-built DLC image. To build and deploy your own container instead, see Method 7: Run on AWS SageMaker in the installation guide.

Container image

AWS publishes pre-built, security-patched SGLang DLCs. The SageMaker GPU image is available from the Amazon ECR registry (account 763104351884) in each supported region. For example, in us-west-2:
763104351884.dkr.ecr.us-west-2.amazonaws.com/sglang:server-sagemaker-cuda-v1.0
For the full list of image tags, see the Available DLC Images reference, and for region-specific account IDs and supported regions, see Region Availability.

Specifying the model

The SageMaker image resolves the model in this order:
  1. SM_SGLANG_MODEL_PATH environment variable — explicit Hugging Face ID or path.
  2. /opt/ml/model — when SageMaker mounts model artifacts via ModelDataUrl or ModelDataSource, the entrypoint uses this path by default.
For gated models, also pass HF_TOKEN. Any SM_SGLANG_* environment variable is converted to a --<name> SGLang server argument (for example, SM_SGLANG_CONTEXT_LENGTH=4096 becomes --context-length 4096).

Deploy with the SageMaker Python SDK

from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer

model = Model(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/sglang:server-sagemaker-cuda-v1.0",
    role="arn:aws:iam::<account_id>:role/<role_name>",
    predictor_cls=Predictor,
    env={"SM_SGLANG_MODEL_PATH": "openai/gpt-oss-20b"},
)

predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    inference_ami_version="al2023-ami-sagemaker-inference-gpu-4-1",
    serializer=JSONSerializer(),
)

response = predictor.predict({
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is deep learning?"}],
    "max_tokens": 256,
})
print(response)

# Cleanup
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Deploy with Boto3

import json
import boto3

sm = boto3.client("sagemaker")
smrt = boto3.client("sagemaker-runtime")

sm.create_model(
    ModelName="sglang-model",
    PrimaryContainer={
        "Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/sglang:server-sagemaker-cuda-v1.0",
        "Environment": {"SM_SGLANG_MODEL_PATH": "openai/gpt-oss-20b"},
    },
    ExecutionRoleArn="arn:aws:iam::<account_id>:role/<role_name>",
)

sm.create_endpoint_config(
    EndpointConfigName="sglang-config",
    ProductionVariants=[{
        "VariantName": "default",
        "ModelName": "sglang-model",
        "InstanceType": "ml.g5.2xlarge",
        "InitialInstanceCount": 1,
        "InferenceAmiVersion": "al2023-ami-sagemaker-inference-gpu-4-1",
    }],
)

sm.create_endpoint(EndpointName="sglang-endpoint", EndpointConfigName="sglang-config")
sm.get_waiter("endpoint_in_service").wait(EndpointName="sglang-endpoint")

resp = smrt.invoke_endpoint(
    EndpointName="sglang-endpoint",
    ContentType="application/json",
    Body=json.dumps({
        "model": "openai/gpt-oss-20b",
        "messages": [{"role": "user", "content": "What is deep learning?"}],
        "max_tokens": 256,
    }),
)
print(json.loads(resp["Body"].read()))

# Cleanup
sm.delete_endpoint(EndpointName="sglang-endpoint")
sm.delete_endpoint_config(EndpointConfigName="sglang-config")
sm.delete_model(ModelName="sglang-model")

Model artifacts

When ModelDataUrl (or ModelDataSource) points to a tarball or S3 prefix, SageMaker mounts the contents at /opt/ml/model. The entrypoint defaults --model-path to that location, so SM_SGLANG_MODEL_PATH can be omitted:
model.tar.gz
├── config.json              # standard model files (Hugging Face layout)
├── tokenizer.json
└── *.safetensors

Notes

  • GPU deployments require inference_ami_version — the default SageMaker host AMI has incompatible NVIDIA drivers for CUDA 13 images. See the ProductionVariant API reference for valid values.
  • The endpoint exposes an OpenAI-compatible API, so the request body matches the SGLang server’s /v1/chat/completions schema.