1. Model Introduction
Qwen2.5-VL is a vision-language model series from the Qwen team, offering significant improvements over its predecessor in understanding, reasoning, and multi-modal processing. Key Features:- Understand things visually: Proficient in recognizing common objects such as flowers, birds, fish, and insects, and it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
- More Agentic: Play as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
- Understanding long videos and capturing events: Supports comprehending videos of over 1 hour, and this time it has a new ability of capturing event by pinpointing the relevant video segments.
- Capable of visual localization in different formats: Accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
- Generating structured outputs: Supports structured outputs of the contents, benefiting usages in finance, commerce, etc for data like scans of invoices, forms, tables, etc.
- Dynamic Resolution and Frame Rate Training for Video Understanding: Extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
- Multiple Sizes: Available in 3B, 7B, 32B, and 72B variants to suit different deployment needs.
- ROCm Support: Compatible with AMD MI300X, MI325X and MI355X GPUs via SGLang (verified).
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for AMD MI300X, MI325X and MI355X hardware platforms and different use cases.3.1 Basic Configuration
The Qwen2.5-VL series offers models in various sizes. The following configurations have been verified on AMD MI300X, MI325X and MI355X GPUs. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model size.3.2 Configuration Tips
- Memory Management: For the 72B model on MI300X/MI325X/MI355X, we have verified successful deployment with
--context-length 128000. Smaller context lengths can be used to reduce memory usage if needed. - Multi-GPU Deployment: Use Tensor Parallelism (
--tp) to scale across multiple GPUs. For example, use--tp 8for the 72B model and--tp 2for the 32B model on MI300X/MI325X/MI355X.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Multi-Modal Inputs
Qwen2.5-VL supports image inputs. Here’s a basic example with single image input:Example
Output
Example
Output
- You can also provide local file paths using
file://protocol. - For larger images, you may need more memory, adjust
--mem-fraction-staticaccordingly.
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: AMD MI300X GPU (8x)
- Model: Qwen2.5-VL-72B-Instruct
- Tensor Parallelism: 8
- SGLang Version: 0.5.6
5.1.1 Latency-Sensitive Benchmark
- Model Deployment Command:
Command
- Benchmark Command:
Command
5.1.2 Throughput-Sensitive Benchmark
- Model Deployment Command:
Command
- Result:
Output
- Benchmark Command:
Command
Output
5.2 Accuracy Benchmark
5.2.1 MMMU Benchmark
You can evaluate the model’s accuracy using the MMMU dataset:- Benchmark Command:
Command
Output
