1. Model Introduction
Step3-VL-10B is a lightweight open-source multimodal model developed by StepFun, designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact 10B parameter footprint, Step3-VL-10B excels in visual perception, complex reasoning, and human-centric alignment. Key highlights of Step3-VL-10B include:- STEM Reasoning: Achieves 94.43% on AIME 2025 and 75.95% on MathVision (with PaCoRe), demonstrating exceptional complex reasoning capabilities that outperform models 10×–20× larger.
- Visual Perception: Records 92.05% on MMBench and 80.11% on MMMU, establishing strong general visual understanding and multimodal reasoning.
- GUI & OCR: Delivers state-of-the-art performance on ScreenSpot-V2 (92.61%), ScreenSpot-Pro (51.55%), and OCRBench (86.75%), optimized for agentic and document understanding tasks.
- Spatial Understanding: Demonstrates emergent spatial awareness with 66.79% on BLINK and 57.21% on All-Angles-Bench, establishing strong potential for embodied intelligence applications.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Step3-VL-10B is a compact 10B dense model that can run on a single GPU. Recommended starting configurations vary depending on hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and quantization method. SGLang supports serving Step3-VL-10B on NVIDIA B200, H200, H100, and AMD MI355X, MI325X, MI300X GPUs.3.2 Configuration Tips
- Single GPU Deployment: Step3-VL-10B fits comfortably on a single GPU with BF16 precision, no tensor parallelism required.
- Memory Management: Set lower
--context-lengthto conserve memory if needed. A value of32768is sufficient for most scenarios. - FP8 Quantization: Use FP8 quantization to further reduce memory usage while maintaining quality.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Multi-Modal Inputs
Step3-VL-10B supports image inputs. Here’s a basic example with image input:Example
Output
Example
Output
4.2.2 Reasoning Parser
Step3-VL-10B supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:Command
Example
Output
4.2.3 Tool Calling
Step3-VL-10B supports tool calling capabilities. Enable the tool call parser:Command
Example
Output
Example
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU (1x)
- Model: stepfun-ai/Step3-VL-10B
- Tensor Parallelism: 1
- sglang version: 0.5.8+
5.1.1 Latency-Sensitive Benchmark
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Result:
Output
5.1.2 Throughput-Sensitive Benchmark
- Benchmark Command:
Command
- Result:
Output
5.2 Accuracy Benchmark
5.2.1 MMMU Benchmark
You can evaluate the model’s accuracy using the MMMU dataset:- Model Deployment Command:
Command
- Benchmark Command:
Command
- Result:
Output
