Query VLM with Offline Engine#

This tutorial demonstrates how to use SGLang’s offline Engine API to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:

  1. Basic Call: Directly pass images and text.

  2. Processor Output: Use HuggingFace processor for data preprocessing.

  3. Precomputed Embeddings: Pre-calculate image features to improve inference efficiency.

Understanding the Three Input Formats#

SGLang supports three ways to pass visual data, each optimized for different scenarios:

1. Raw Images - Simplest approach#

  • Pass PIL Images, file paths, URLs, or base64 strings directly

  • SGLang handles all preprocessing automatically

  • Best for: Quick prototyping, simple applications

2. Processor Output - For custom preprocessing#

  • Pre-process images with HuggingFace processor

  • Pass the complete processor output dict with format: "processor_output"

  • Best for: Custom image transformations, integration with existing pipelines

  • Requirement: Must use input_ids instead of text prompt

3. Precomputed Embeddings - For maximum performance#

  • Pre-calculate visual embeddings using the vision encoder

  • Pass embeddings with format: "precomputed_embedding"

  • Best for: Repeated queries on same images, caching, high-throughput serving

  • Performance gain: Avoids redundant vision encoder computation (30-50% speedup)

Key Rule: Within a single request, use only one format for all images. Don’t mix formats.

The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models.

Querying Qwen2.5-VL Model#

[1]:
import nest_asyncio

nest_asyncio.apply()

import sglang.test.doc_patch  # noqa: F401

model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
chat_template = "qwen2-vl"
example_image_url = "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"
[2]:
from io import BytesIO
import requests
from PIL import Image

from sglang.srt.parser.conversation import chat_templates

image = Image.open(BytesIO(requests.get(example_image_url).content))

conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]

print("Generated prompt text:")
print(conv.get_prompt())
print(f"\nImage size: {image.size}")
image
Generated prompt text:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's shown here: <|vision_start|><|image_pad|><|vision_end|>?<|im_end|>
<|im_start|>assistant


Image size: (570, 380)
[2]:
../_images/advanced_features_vlm_query_4_1.png

Basic Offline Engine API Call#

[3]:
from sglang import Engine

llm = Engine(model_path=model_path, chat_template=chat_template, log_level="warning")
[2026-04-10 01:26:13] Ignore import error when loading sglang.srt.multimodal.processors.gemma4: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-10 01:26:16] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-10 01:26:16] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-10 01:26:16] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-10 01:26:16] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-10 01:26:23] Ignore import error when loading sglang.srt.multimodal.processors.gemma4: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-04-10 01:26:24] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-10 01:26:24] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-10 01:26:24] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-10 01:26:24] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Multi-thread loading shards: 100% Completed | 2/2 [00:01<00:00,  1.23it/s]
[4]:
out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Model response:")
print(out["text"])
2026-04-10 01:26:30,133 - CUTE_DSL - WARNING - [handle_import_error] - Unexpected error during package walk: cutlass.cute.experimental
[2026-04-10 01:26:30] Unexpected error during package walk: cutlass.cute.experimental
Model response:
The image shows a fire endurance test for an Uber car. A man is donning a Strawberry Letter 80 graphic shirt, helmet, and knee pads, and he is tentatively putting on his knee pads as he gets ready for the test. A yellow taxi cab and a grey brick building surround him. Several red flags and yellow bumper stickers are on the taxi's windows and doors in front of the tall brick building. A striped trash can with red, yellow, blue, and grey stripes is situated next to the street.

Call with Processor Output#

Using a HuggingFace processor to preprocess text and images, and passing the processor_output directly into Engine.generate.

[5]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

out = llm.generate(
    input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
    image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out["text"])
Response using processor output:
The image shows a yellow taxi in New York City, an iconic vehicle known for its presence on the streets of Manhattan. It appears to be involved in an unusual situation, as the taxi is equipped with two large signs, one on the top and another on the side, seemingly to convey an advertisement or message.

Call with Precomputed Embeddings#

You can pre-calculate image features to avoid repeated visual encoding processes.

[6]:
from transformers import AutoProcessor
from transformers import Qwen2_5_VLForConditionalGeneration

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval()
vision = model.model.visual.cuda()
[7]:
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

input_ids = processor_output["input_ids"][0].detach().cpu().tolist()

precomputed_embeddings = vision(
    processor_output["pixel_values"].cuda(), processor_output["image_grid_thw"].cuda()
)
precomputed_embeddings = precomputed_embeddings.pooler_output

multi_modal_item = dict(
    processor_output,
    format="precomputed_embedding",
    feature=precomputed_embeddings,
)

out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])
print("Response using precomputed embeddings:")
print(out["text"])

llm.shutdown()
Response using precomputed embeddings:
This image shows a scene of a street in what appears to be an urban area, potentially a city like New York. A yellow car is stopped on the road with its trunk open, and a man is handing bags to a cab driver. The car is a typical yellow taxi used for public transportation in many cities. This type of street scene is common in places where taxis are used economically and efficiently to transporte people, cargo, and sometimes baggage.

Querying Llama 4 Vision Model#

model_path = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
chat_template = "llama-4"

from io import BytesIO
import requests
from PIL import Image

from sglang.srt.parser.conversation import chat_templates

# Download the same example image
image = Image.open(BytesIO(requests.get(example_image_url).content))

conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]

print("Llama 4 generated prompt text:")
print(conv.get_prompt())
print(f"Image size: {image.size}")

image

Llama 4 Basic Call#

Llama 4 requires more computational resources, so it’s configured with multi-GPU parallelism (tp_size=4) and larger context length.

llm = Engine(
    model_path=model_path,
    enable_multimodal=True,
    attention_backend="fa3",
    tp_size=4,
    context_length=65536,
)

out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Llama 4 response:")
print(out["text"])

Call with Processor Output#

Using HuggingFace processor to preprocess data can reduce computational overhead during inference.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

out = llm.generate(
    input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
    image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out)

Call with Precomputed Embeddings#

from transformers import AutoProcessor
from transformers import Llama4ForConditionalGeneration

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto"
).eval()

vision = model.vision_model.cuda()
multi_modal_projector = model.multi_modal_projector.cuda()

print(f'Image pixel values shape: {processor_output["pixel_values"].shape}')
input_ids = processor_output["input_ids"][0].detach().cpu().tolist()

# Process image through vision encoder
image_outputs = vision(
    processor_output["pixel_values"].to("cuda"),
    aspect_ratio_ids=processor_output["aspect_ratio_ids"].to("cuda"),
    aspect_ratio_mask=processor_output["aspect_ratio_mask"].to("cuda"),
    output_hidden_states=False
)
image_features = image_outputs.last_hidden_state

# Flatten image features and pass through multimodal projector
vision_flat = image_features.view(-1, image_features.size(-1))
precomputed_embeddings = multi_modal_projector(vision_flat)

# Build precomputed embedding data item
mm_item = dict(
    processor_output,
    format="precomputed_embedding",
    feature=precomputed_embeddings
)

# Use precomputed embeddings for efficient inference
out = llm.generate(input_ids=input_ids, image_data=[mm_item])
print("Llama 4 precomputed embedding response:")
print(out["text"])