SGLang Frontend Language#

SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way.

Launch A Server#

Launch the server in your terminal and wait for it to initialize.

[1]:
from sglang import assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint
from sglang.lang.api import set_default_backend
from sglang.srt.utils import load_image
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import print_highlight, terminate_process, wait_for_server

server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --log-level warning"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
[2026-01-08 09:02:22] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2026-01-08 09:02:22] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2026-01-08 09:02:22] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2026-01-08 09:02:29] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2026-01-08 09:02:29] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2026-01-08 09:02:29] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2026-01-08 09:02:32] INFO server_args.py:1615: Attention backend not specified. Use fa3 backend by default.
[2026-01-08 09:02:32] INFO server_args.py:2512: Set soft_watchdog_timeout since in CI
[2026-01-08 09:02:38] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2026-01-08 09:02:38] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2026-01-08 09:02:38] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2026-01-08 09:02:39] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2026-01-08 09:02:39] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2026-01-08 09:02:39] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-01-08 09:02:44] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.53it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.41it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.46it/s]

Capturing batches (bs=1 avail_mem=62.73 GB): 100%|██████████| 3/3 [00:00<00:00,  9.47it/s]


NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
To reduce the log length, we set the log level to warning for the server, the default log level is info.
We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.
Server started on http://localhost:30309

Set the default backend. Note: Besides the local server, you may use also OpenAI or other API endpoints.

[2]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
[2026-01-08 09:02:55] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.

Basic Usage#

The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant.

[3]:
@function
def basic_qa(s, question):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user(question)
    s += assistant(gen("answer", max_tokens=512))
[4]:
state = basic_qa("List 3 countries and their capitals.")
print_highlight(state["answer"])
Sure! Here are three countries along with their capitals:

1. **France** - Paris
2. **Brazil** - Brasília
3. **Australia** - Canberra

Multi-turn Dialog#

SGLang frontend language can also be used to define multi-turn dialogs.

[5]:
@function
def multi_turn_qa(s):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user("Please give me a list of 3 countries and their capitals.")
    s += assistant(gen("first_answer", max_tokens=512))
    s += user("Please give me another list of 3 countries and their capitals.")
    s += assistant(gen("second_answer", max_tokens=512))
    return s


state = multi_turn_qa()
print_highlight(state["first_answer"])
print_highlight(state["second_answer"])
Sure! Here is a list of three countries along with their respective capitals:

1. **France** - Paris
2. **Australia** - Canberra
3. **Canada** - Ottawa
Certainly! Here is another list of three countries along with their respective capitals:

1. **Spain** - Madrid
2. **Japan** - Tokyo
3. **India** - New Delhi

Control flow#

You may use any Python code within the function to define more complex control flows.

[6]:
@function
def tool_use(s, question):
    s += assistant(
        "To answer this question: "
        + question
        + ". I need to use a "
        + gen("tool", choices=["calculator", "search engine"])
        + ". "
    )

    if s["tool"] == "calculator":
        s += assistant("The math expression is: " + gen("expression"))
    elif s["tool"] == "search engine":
        s += assistant("The key word to search is: " + gen("word"))


state = tool_use("What is 2 * 2?")
print_highlight(state["tool"])
print_highlight(state["expression"])
calculator
2 * 2.

When multiplied, it equals 4.

You didn't need a calculator for this one, but the answer is 4.

Parallelism#

Use fork to launch parallel prompts. Because sgl.gen is non-blocking, the for loop below issues two generation calls in parallel.

[7]:
@function
def tip_suggestion(s):
    s += assistant(
        "Here are two tips for staying healthy: "
        "1. Balanced Diet. 2. Regular Exercise.\n\n"
    )

    forks = s.fork(2)
    for i, f in enumerate(forks):
        f += assistant(
            f"Now, expand tip {i+1} into a paragraph:\n"
            + gen("detailed_tip", max_tokens=256, stop="\n\n")
        )

    s += assistant("Tip 1:" + forks[0]["detailed_tip"] + "\n")
    s += assistant("Tip 2:" + forks[1]["detailed_tip"] + "\n")
    s += assistant(
        "To summarize the above two tips, I can say:\n" + gen("summary", max_tokens=512)
    )


state = tip_suggestion()
print_highlight(state["summary"])
1. **Balanced Diet**: Eating a varied and nutritious diet is crucial for maintaining good health. Focus on incorporating a wide range of fruits, vegetables, lean proteins, whole grains, and healthy fats. Also, pay attention to portion sizes and stay hydrated.

2. **Regular Exercise**: Engage in physical activity regularly to enhance your overall fitness. Start with simple exercises like walking or jogging and gradually increase the intensity. Mix up your routine to keep it interesting, and ensure you maintain a consistent schedule to see the best results.

By combining these two practices, you can significantly improve your health and well-being!

Constrained Decoding#

Use regex to specify a regular expression as a decoding constraint. This is only supported for local models.

[8]:
@function
def regular_expression_gen(s):
    s += user("What is the IP address of the Google DNS servers?")
    s += assistant(
        gen(
            "answer",
            temperature=0,
            regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
        )
    )


state = regular_expression_gen()
print_highlight(state["answer"])
208.67.222.222

Use regex to define a JSON decoding schema.

[9]:
character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)


@function
def character_gen(s, name):
    s += user(
        f"{name} is a character in Harry Potter. Please fill in the following information about this character."
    )
    s += assistant(gen("json_output", max_tokens=256, regex=character_regex))


state = character_gen("Harry Potter")
print_highlight(state["json_output"])
{
"name": "Harry Potter",
"house": "Gryffindor",
"blood status": "Half-blood",
"occupation": "student",
"wand": {
"wood": "Hawthorn",
"core": "Horsehair",
"length": 10.25
},
"alive": "Alive",
"patronus": "Stag",
"bogart": "Rat"
}

Batching#

Use run_batch to run a batch of prompts.

[10]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
        {"question": "What is the capital of Japan?"},
    ],
    progress_bar=True,
)

for i, state in enumerate(states):
    print_highlight(f"Answer {i+1}: {states[i]['answer']}")
100%|██████████| 3/3 [00:00<00:00, 34.95it/s]
Answer 1: The capital of the United Kingdom is London.
Answer 2: The capital of France is Paris.
Answer 3: The capital of Japan is Tokyo.

Streaming#

Use stream to stream the output to the user.

[11]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


state = text_qa.run(
    question="What is the capital of France?", temperature=0.1, stream=True
)

for out in state.text_iter():
    print(out, end="", flush=True)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>

Complex Prompts#

You may use {system|user|assistant}_{begin|end} to define complex prompts.

[12]:
@function
def chat_example(s):
    s += system("You are a helpful assistant.")
    # Same as: s += s.system("You are a helpful assistant.")

    with s.user():
        s += "Question: What is the capital of France?"

    s += assistant_begin()
    s += "Answer: " + gen("answer", max_tokens=100, stop="\n")
    s += assistant_end()


state = chat_example()
print_highlight(state["answer"])
The capital of France is Paris.
[13]:
terminate_process(server_process)

Multi-modal Generation#

You may use SGLang frontend language to define multi-modal prompts. See here for supported models.

[14]:
server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --log-level warning"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
[2026-01-08 09:03:06] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2026-01-08 09:03:06] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2026-01-08 09:03:06] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2026-01-08 09:03:08] INFO server_args.py:1615: Attention backend not specified. Use flashinfer backend by default.
[2026-01-08 09:03:08] INFO server_args.py:2512: Set soft_watchdog_timeout since in CI
[2026-01-08 09:03:11] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-01-08 09:03:14] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2026-01-08 09:03:14] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2026-01-08 09:03:14] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2026-01-08 09:03:14] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2026-01-08 09:03:14] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2026-01-08 09:03:14] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-01-08 09:03:21] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.26it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.20it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.18it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.31it/s]

Capturing batches (bs=1 avail_mem=60.84 GB): 100%|██████████| 3/3 [00:00<00:00,  3.68it/s]


NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
To reduce the log length, we set the log level to warning for the server, the default log level is info.
We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.
Server started on http://localhost:38993
[15]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
[2026-01-08 09:03:34] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.

Ask a question about an image.

[16]:
@function
def image_qa(s, image_file, question):
    s += user(image(image_file) + question)
    s += assistant(gen("answer", max_tokens=256))


image_url = "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
image_bytes, _ = load_image(image_url)
state = image_qa(image_bytes, "What is in the image?")
print_highlight(state["answer"])
The image shows a man attending to folding or setting out drying clothes from a mid-mounted rack attached to the back of a yellow SUV. This vehicle retains its purpose monocycle, tying place unit, incorporating adjustable mountable phone hooks gripping rear flat creating hurdles fold peg lager euthanasia agama beta nest components inferential units portable institution stride cruising barrage analogy beta kernal parked signify invention discipline grows easing channel reserve acceptable diy insider ceased scrap metal lance organise tower murdered simpler variant agenda mapping interface realize stoneolest essential motive fetching eccentric handling downloads engine portability applications overhang exceptional increase engine mounting from cosmetic coaching panic rng merchmgr punitive temples progress financing estates pouco portrait guardian porridge minimal clipboard soil project mplications prugnel monitorable accommodate reinforcements remorseless pterygium reflector escort approach crest substation remedies jeopardy amplify flaw agitation namespaces followers reap implemented frantic cryptocurrency pundit diversion unreachable conquers classifications wrangles resonate yiimp object constraints uninstall execrable little obedience misrule [("()",oj({})
[17]:
terminate_process(server_process)