Query VLM with Offline Engine#
This tutorial demonstrates how to use SGLang’s offline Engine API to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:
Basic Call: Directly pass images and text.
Processor Output: Use HuggingFace processor for data preprocessing.
Precomputed Embeddings: Pre-calculate image features to improve inference efficiency.
Understanding the Three Input Formats#
SGLang supports three ways to pass visual data, each optimized for different scenarios:
1. Raw Images - Simplest approach#
Pass PIL Images, file paths, URLs, or base64 strings directly
SGLang handles all preprocessing automatically
Best for: Quick prototyping, simple applications
2. Processor Output - For custom preprocessing#
Pre-process images with HuggingFace processor
Pass the complete processor output dict with
format: "processor_output"Best for: Custom image transformations, integration with existing pipelines
Requirement: Must use
input_idsinstead of text prompt
3. Precomputed Embeddings - For maximum performance#
Pre-calculate visual embeddings using the vision encoder
Pass embeddings with
format: "precomputed_embedding"Best for: Repeated queries on same images, caching, high-throughput serving
Performance gain: Avoids redundant vision encoder computation (30-50% speedup)
Key Rule: Within a single request, use only one format for all images. Don’t mix formats.
The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models.
Querying Qwen2.5-VL Model#
[1]:
import nest_asyncio
nest_asyncio.apply()
import sglang.test.doc_patch # noqa: F401
model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
chat_template = "qwen2-vl"
example_image_url = "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"
[2]:
from io import BytesIO
import requests
from PIL import Image
from sglang.srt.parser.conversation import chat_templates
image = Image.open(BytesIO(requests.get(example_image_url).content))
conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]
print("Generated prompt text:")
print(conv.get_prompt())
print(f"\nImage size: {image.size}")
image
Generated prompt text:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's shown here: <|vision_start|><|image_pad|><|vision_end|>?<|im_end|>
<|im_start|>assistant
Image size: (570, 380)
[2]:
Basic Offline Engine API Call#
[3]:
from sglang import Engine
llm = Engine(model_path=model_path, chat_template=chat_template, log_level="warning")
[2026-04-09 23:25:45] Ignore import error when loading sglang.srt.multimodal.processors.gemma4: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-09 23:25:47] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-09 23:25:47] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-09 23:25:48] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-09 23:25:48] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-09 23:25:55] Ignore import error when loading sglang.srt.multimodal.processors.gemma4: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-04-09 23:25:55] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-09 23:25:55] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-09 23:25:55] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-04-09 23:25:55] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Multi-thread loading shards: 100% Completed | 2/2 [00:01<00:00, 1.38it/s]
[4]:
out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Model response:")
print(out["text"])
2026-04-09 23:26:01,548 - CUTE_DSL - WARNING - [handle_import_error] - Unexpected error during package walk: cutlass.cute.experimental
[2026-04-09 23:26:01] Unexpected error during package walk: cutlass.cute.experimental
Model response:
The image shows two yellow taxis parked on a city street, with a person drying clothes outside one of the cabs using a clothesline. The scene likely takes place in an urban area in the United States, as indicated by the presence of an American flag. The taxicabs, known as "slim sonics" or "thick sonics" in the context, can be identified by their distinctive rectangular shape and the paint patterns on their rear.
Call with Processor Output#
Using a HuggingFace processor to preprocess text and images, and passing the processor_output directly into Engine.generate.
[5]:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
images=[image], text=conv.get_prompt(), return_tensors="pt"
)
out = llm.generate(
input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out["text"])
Response using processor output:
I'm not sure what you're referring to. Can you clarify what the image shows or provide more context?
Call with Precomputed Embeddings#
You can pre-calculate image features to avoid repeated visual encoding processes.
[6]:
from transformers import AutoProcessor
from transformers import Qwen2_5_VLForConditionalGeneration
processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval()
vision = model.model.visual.cuda()
[7]:
processor_output = processor(
images=[image], text=conv.get_prompt(), return_tensors="pt"
)
input_ids = processor_output["input_ids"][0].detach().cpu().tolist()
precomputed_embeddings = vision(
processor_output["pixel_values"].cuda(), processor_output["image_grid_thw"].cuda()
)
precomputed_embeddings = precomputed_embeddings.pooler_output
multi_modal_item = dict(
processor_output,
format="precomputed_embedding",
feature=precomputed_embeddings,
)
out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])
print("Response using precomputed embeddings:")
print(out["text"])
llm.shutdown()
Response using precomputed embeddings:
The image you've provided shows a street scene in what appears to be a city, with several notable details:
1. **Taxi**: There is a yellow taxi, which is commonly associated with New York City. The typical yellow taxi cabs are found in Manhattan and other parts of the city.
2. **Clothing**: There are several pieces of clothing displayed in the foreground, with objects that resemble clothes hangers, indicating that someone might be inspecting or handling these items.
3. **Person**: There is a person wearing a yellow shirt. It's unclear whether the person is standing or moving, but attention has been drawn to
Querying Llama 4 Vision Model#
model_path = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
chat_template = "llama-4"
from io import BytesIO
import requests
from PIL import Image
from sglang.srt.parser.conversation import chat_templates
# Download the same example image
image = Image.open(BytesIO(requests.get(example_image_url).content))
conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]
print("Llama 4 generated prompt text:")
print(conv.get_prompt())
print(f"Image size: {image.size}")
image
Llama 4 Basic Call#
Llama 4 requires more computational resources, so it’s configured with multi-GPU parallelism (tp_size=4) and larger context length.
llm = Engine(
model_path=model_path,
enable_multimodal=True,
attention_backend="fa3",
tp_size=4,
context_length=65536,
)
out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Llama 4 response:")
print(out["text"])
Call with Processor Output#
Using HuggingFace processor to preprocess data can reduce computational overhead during inference.
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
images=[image], text=conv.get_prompt(), return_tensors="pt"
)
out = llm.generate(
input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out)
Call with Precomputed Embeddings#
from transformers import AutoProcessor
from transformers import Llama4ForConditionalGeneration
processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Llama4ForConditionalGeneration.from_pretrained(
model_path, torch_dtype="auto"
).eval()
vision = model.vision_model.cuda()
multi_modal_projector = model.multi_modal_projector.cuda()
print(f'Image pixel values shape: {processor_output["pixel_values"].shape}')
input_ids = processor_output["input_ids"][0].detach().cpu().tolist()
# Process image through vision encoder
image_outputs = vision(
processor_output["pixel_values"].to("cuda"),
aspect_ratio_ids=processor_output["aspect_ratio_ids"].to("cuda"),
aspect_ratio_mask=processor_output["aspect_ratio_mask"].to("cuda"),
output_hidden_states=False
)
image_features = image_outputs.last_hidden_state
# Flatten image features and pass through multimodal projector
vision_flat = image_features.view(-1, image_features.size(-1))
precomputed_embeddings = multi_modal_projector(vision_flat)
# Build precomputed embedding data item
mm_item = dict(
processor_output,
format="precomputed_embedding",
feature=precomputed_embeddings
)
# Use precomputed embeddings for efficient inference
out = llm.generate(input_ids=input_ids, image_data=[mm_item])
print("Llama 4 precomputed embedding response:")
print(out["text"])