1. Model Introduction
Step-3.5-Flash is StepFun’s production-grade reasoning engine built to decouple elite intelligence from heavy compute, and cuts attention cost for low-latency, cost-effective long-context inference—purpose-built for autonomous agents in real-world workflows. The model is available in multiple quantization formats optimized for different hardware platforms. This generation delivers comprehensive upgrades across the board:- Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 3:1 ratio and an aggressive 128-token window. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.
- Sparse Mixture-of-Experts: Only 11B active parameters out of 196B parameters.
- Multi-Layer Multi-Token Prediction (MTP): Equipped with a 3-way Multi-Token Prediction (MTP-3). This allows for complex, multi-step reasoning chains with immediate responsiveness.
2.SGLang Installation
Step-3.5-Flash is currently available in SGLang via Docker image install.Docker (NVIDIA)
Command
Docker (AMD ROCm)
Command
3.Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
The Step-3.5-Flash series comes in only one sizes. Recommended starting configurations vary depending on hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities.3.2 Configuration Tips
- Memory: Requires GPUs with high VRAM capacity. Supported platforms: H200 (4×, TP=4), MI300X/MI325X/MI350X/MI355X (4×, TP=4 EP=4).
- AMD Docker Image: Use
lmsysorg/sglang:v0.5.9-rocm700-mi30xfor MI300X/MI325X andlmsysorg/sglang:v0.5.9-rocm700-mi35xfor MI350X/MI355X. - AMD Expert Parallelism Required: On AMD GPUs, always use
--ep 4with--tp 4. Both BF16 and FP8 models require expert parallelism. Without EP, the MoE intermediate dimension is split across GPUs (N=320), which triggers an AITER CK GEMM incompatibility. With EP=4, each GPU handles 72 full experts (N=1280), which works correctly with cuda graph enabled. - AITER JIT Compilation: First inference on AMD may take 30-40 seconds for AITER kernel JIT compilation. Subsequent requests use cached kernels.
4.Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Reasoning Parser
Step-3.5-Flash only supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:Command
Example
Output
4.2.2 Tool Calling
Step-3.5 supports tool calling capabilities. Enable the tool call parser: Python Example: Start sglang server:Command
Example
Output
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA H200 GPU (4x)
- Model: Step-3.5-Flash
- Tensor Parallelism: 4
- Expert Parallelism: 4
- sglang version: 0.5.8
5.1.1 Standard Scenario Benchmark
- Model Deployment Command:
Command
5.1.1.1 Low Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.1.2 Medium Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.1.3 High Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
-
Results:
- Step-3.5-Flash
- Step-3.5-Flash
