1. Model Introduction
Step-3.7-Flash is a 198B-parameter Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and supports a 256k context window with three selectable reasoning levels (low, medium, and high). The model is available in multiple quantization formats (BF16, FP8, NVFP4). Step-3.7-Flash is built for developers who need to scale agentic workflows that combine perception, search, and reasoning — from parsing massive financial reports in one pass, to running multi-step search loops with cross-source verification, to operating concurrent coding agents in high-throughput pipelines.2. SGLang Installation
Step-3.7-Flash is currently available in SGLang via Docker image install.Docker (NVIDIA)
Command
3. Model Deployment
This section provides deployment configurations optimized for different use cases.3.1 Basic Configuration
The Step-3.7-Flash series comes in one size with multiple quantization options. Recommended starting configurations vary depending on hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities.3.2 Configuration Tips
- Memory: Requires GPUs with high VRAM capacity. Supported platforms: H200 (4x, TP=4), B200/B300 (4x, TP=4), GB200/GB300 (4x, TP=4).
- NVFP4 Quantization: NVFP4 provides the smallest memory footprint. Requires
--quantization modelopt_fp4 --kv-cache-dtype fp8_e4m3 --moe-runner-backend flashinfer_trtllm. - Trust Remote Code: All Step-3.7-Flash variants require
--trust-remote-codedue to the custom model architecture.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Multi-Modal Inputs
Step-3.7-Flash supports image inputs alongside text. Here’s a basic example:Example
Example
4.2.2 Reasoning Parser
Step-3.7-Flash supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:Command
Example
4.2.3 Tool Calling
Step-3.7-Flash supports tool calling capabilities. Enable the tool call parser: Start sglang server:Command
Example
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
