# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-23 21:42:02] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-23 21:42:02] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-23 21:42:02] INFO utils.py:164: NumExpr defaulting to 16 threads.






[2025-11-23 21:42:11] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-23 21:42:11] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-23 21:42:11] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.82it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.41 GB):   5%|▌         | 1/20 [00:00<00:04,  4.32it/s]Capturing batches (bs=120 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:04,  4.32it/s]Capturing batches (bs=112 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:04,  4.32it/s]Capturing batches (bs=104 avail_mem=76.29 GB):   5%|▌         | 1/20 [00:00<00:04,  4.32it/s]Capturing batches (bs=104 avail_mem=76.29 GB):  20%|██        | 4/20 [00:00<00:01, 13.14it/s]Capturing batches (bs=96 avail_mem=76.29 GB):  20%|██        | 4/20 [00:00<00:01, 13.14it/s] Capturing batches (bs=88 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:01, 13.14it/s]

Capturing batches (bs=88 avail_mem=76.28 GB):  30%|███       | 6/20 [00:00<00:00, 14.66it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  30%|███       | 6/20 [00:00<00:00, 14.66it/s]Capturing batches (bs=72 avail_mem=76.27 GB):  30%|███       | 6/20 [00:00<00:00, 14.66it/s]Capturing batches (bs=64 avail_mem=76.27 GB):  30%|███       | 6/20 [00:00<00:00, 14.66it/s]Capturing batches (bs=64 avail_mem=76.27 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.37it/s]Capturing batches (bs=56 avail_mem=76.26 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.37it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.37it/s]

Capturing batches (bs=40 avail_mem=76.25 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.37it/s]Capturing batches (bs=40 avail_mem=76.25 GB):  60%|██████    | 12/20 [00:00<00:00, 19.48it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  60%|██████    | 12/20 [00:00<00:00, 19.48it/s]Capturing batches (bs=24 avail_mem=76.25 GB):  60%|██████    | 12/20 [00:00<00:00, 19.48it/s]Capturing batches (bs=16 avail_mem=76.24 GB):  60%|██████    | 12/20 [00:00<00:00, 19.48it/s]

Capturing batches (bs=16 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:00<00:00, 16.68it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:00<00:00, 16.68it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:00<00:00, 16.68it/s] Capturing batches (bs=4 avail_mem=76.22 GB):  75%|███████▌  | 15/20 [00:01<00:00, 16.68it/s]Capturing batches (bs=4 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.33it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.33it/s]Capturing batches (bs=1 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.33it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:01<00:00, 17.83it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I'm from the United States. I'm here to share with you what my life has been like so far and what I have learned throughout this wonderful journey. As I've said before, I have had a great time in China. I have been here for over a year now. The school I'm studying is in Nantong. It's a beautiful city. It's home to some amazing places and people. On the first day of school, we were learning to get along with different cultures. We were learning to be more open to new ideas. I think that was very important. It's not easy to be open
Prompt: The president of the United States is
Generated text:  a title given to the highest official of the government in the executive branch of the federal government of the United States. It is the most senior of the two branches of the government. The vice president is the second most senior official of the government.

The president is elected to a two-year term, except when it is determined that the in

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [age], [gender], [nationality], [occupation], and I have [number] years of experience in [field of work]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm always looking for new challenges and opportunities to grow and learn. What do you enjoy doing? I enjoy [job title], and I'm always looking for new challenges and opportunities to grow and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. It is also known for its cuisine, including its famous croissants and its famous French fries. The city is home to many famous French artists, including Picasso and Van Gogh, and is a major center for the arts. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences.

2. Enhanced machine learning: Machine learning algorithms will become even more sophisticated, allowing AI systems to learn from data and make more accurate predictions and decisions.

3. Improved natural language processing: Natural language processing will become even more advanced, allowing AI systems to understand and respond to human language in ways that are more intuitive and natural.

4. Increased use of AI in healthcare: AI will be used to improve the accuracy and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert first name] and I'm a [insert profession or career] who have a strong passion for [insert something about your career or interests]. I have always been interested in learning more about the world and have always been drawn to [insert what you like to do]. I enjoy [insert why you enjoy doing what you do] and I strive to [insert what you plan to do next]. What’s your name? How do you get started? Here’s how I get started: [insert how you get started] I hope this short, neutral self-introduction is a good start. You can expand on your personality and interests by

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the world’s third largest city and the largest metropolitan area in the European Union. It is also the seat of government, of the French Government, of the European Parliament, of the French

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

 and

 I

'm

 a

/an

 __________________

___

.


I

'm

 excited

 to

 meet

 you

!


As

 an

 AI

 language

 model

,

 I

'm

 here

 to

 provide

 information

 and

 assist

 you

 with

 any

 questions

 you

 may

 have

.

 How

 can

 I

 help

 you

 today

?

 


I

'm

 happy

 to

 introduce

 myself

 as

 an

 AI

 language

 model

.

 My

 name

 is

 C

affe

inated

 AI

 and

 I

'm

 here

 to

 help

 you

 with

 any

 questions

 or

 concerns

 you

 may

 have

.

 How

 can

 I

 assist

 you

 today

?

 


I

'm

 confident

 that

 I

 can

 provide

 you

 with

 useful

 information

 and

 answer

 any

 questions

 you

 may

 have

.

 Let

 me

 know

 if

 there

's

 anything

 specific

 you

'd

 like

 to

 know

 or

 if

 you

 have

 any

 other



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Its

 population

 is

 about

 

2

.

7

 million

,

 and

 it

 is

 the

 largest

 city

 in

 Europe

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 famous

 landmarks

,

 and

 annual

 Paris

ian

 festivals

.

 It

 is

 also

 the

 seat

 of

 government

 and

 culture

 for

 much

 of

 France

.

 The

 city

 is

 characterized

 by

 its

 stunning

 architecture

,

 vibrant

 arts

 scene

,

 and

 cultural

 diversity

.

 It

 is

 often

 referred

 to

 as

 the

 "

City

 of

 Light

"

 due

 to

 its

 numerous

 cinemas

,

 theaters

,

 and

 night

clubs

.

 Paris

 is

 a

 major

 international

 hub

 for

 fashion

,

 entertainment

,

 and

 technology

,

 and

 has

 played

 a

 significant

 role

 in

 shaping

 French

 and

 global

 culture

.

 It

 is

 also

 the

 birth

place

 of

 Napoleon



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 key

 trends

,

 including

:



1

.

 Increased

 automation

:

 AI

 is

 expected

 to

 become

 more

 integrated

 into

 our

 daily

 lives

,

 and

 will

 likely

 automate

 many

 tasks

 that

 are

 currently

 done

 by

 humans

.

 This

 may

 include

 tasks

 such

 as

 logistics

,

 manufacturing

,

 and

 healthcare

,

 which

 are

 currently

 done

 by

 people

.

 However

,

 it

's

 also

 possible

 that

 AI

 will

 also

 be

 used

 to

 automate

 certain

 jobs

 that

 are

 repetitive

 or

 can

 be

 done

 by

 machines

,

 thus

 freeing

 up

 more

 human

 time

 for

 other

 tasks

.



2

.

 Improved

 privacy

 and

 security

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 there

 is

 a

 risk

 that

 it

 may

 also

 be

 used

 to

 collect

 and

 analyze




In [6]:
llm.shutdown()