Documentation Index
Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
Step-3.7-Flash is a 198B-parameter Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and supports a 256k context window with three selectable reasoning levels (low, medium, and high). The model is available in multiple quantization formats (BF16, FP8, NVFP4).
Step-3.7-Flash is built for developers who need to scale agentic workflows that combine perception, search, and reasoning — from parsing massive financial reports in one pass, to running multi-step search loops with cross-source verification, to operating concurrent coding agents in high-throughput pipelines.
2. SGLang Installation
Step-3.7-Flash is currently available in SGLang via Docker image install.
Docker (NVIDIA)
# Pull the docker image
docker pull lmsysorg/sglang:dev-step-3.7-flash
# Launch the container
docker run -it --gpus all \
--shm-size=32g \
--ipc=host \
--network=host \
lmsysorg/sglang:dev-step-3.7-flash bash
3. Model Deployment
This section provides deployment configurations optimized for different use cases.
3.1 Basic Configuration
The Step-3.7-Flash series comes in one size with multiple quantization options. Recommended starting configurations vary depending on hardware.
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities.
3.2 Configuration Tips
- Memory: Requires GPUs with high VRAM capacity. Supported platforms: H200 (4x, TP=4), B200/B300 (4x, TP=4), GB200/GB300 (4x, TP=4).
- NVFP4 Quantization: NVFP4 provides the smallest memory footprint. Requires
--quantization modelopt_fp4 --kv-cache-dtype fp8_e4m3 --moe-runner-backend flashinfer_trtllm.
- Trust Remote Code: All Step-3.7-Flash variants require
--trust-remote-code due to the custom model architecture.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:
4.2 Advanced Usage
Step-3.7-Flash supports image inputs alongside text. Here’s a basic example:
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:30000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Read all the text in the image."
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="stepfun-ai/Step-3.7-Flash",
messages=messages,
max_tokens=2048,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
Multi-Image Input Example:
Step-3.7-Flash can process multiple images in a single request for comparison or analysis:
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:30000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.civitatis.com/f/china/hong-kong/guia/taxi.jpg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.cheapoguides.com/wp-content/uploads/sites/7/2025/05/GettyImages-509614603-1280x600.jpg"
}
},
{
"type": "text",
"text": "Compare these two images and describe the differences in 100 words or less."
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="stepfun-ai/Step-3.7-Flash",
messages=messages,
max_tokens=2048,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
4.2.2 Reasoning Parser
Step-3.7-Flash supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:
sglang serve \
--model-path stepfun-ai/Step-3.7-Flash \
--tp 4 \
--trust-remote-code \
--reasoning-parser step3p5
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="stepfun-ai/Step-3.7-Flash",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
temperature=0.7,
max_tokens=2048,
stream=True
)
# Process the stream
has_thinking = False
has_answer = False
thinking_started = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Step-3.7-Flash supports tool calling capabilities. Enable the tool call parser:
Start sglang server:
sglang serve \
--model-path stepfun-ai/Step-3.7-Flash \
--tp 4 \
--trust-remote-code \
--reasoning-parser step3p5 \
--tool-call-parser step3p5
from openai import OpenAI
import json
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# 1. define tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"}
},
"required": ["location"]
}
}
}
]
# 2. tool run
def get_weather(location, unit="celsius"):
return f"The weather in {location} is 22 {unit[0].upper()} and sunny."
# 3. send first request
print("--- Sending first request ---")
response = client.chat.completions.create(
model="stepfun-ai/Step-3.7-Flash",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
temperature=1.0,
stream=False
)
message = response.choices[0].message
# 4. Handle Reasoning Content
reasoning = getattr(message, 'reasoning_content', None)
if reasoning:
print("=============== Thinking =================")
print(reasoning)
print("==========================================")
# 5. Handle Tool Calls
if message.tool_calls:
print("\nTool Calls detected:")
history_messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
message
]
for tool_call in message.tool_calls:
print(f" Tool: {tool_call.function.name}")
print(f" Args: {tool_call.function.arguments}")
args = json.loads(tool_call.function.arguments)
tool_result = get_weather(args.get("location"), args.get("unit", "celsius"))
history_messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": tool_result
})
print("\n--- Sending tool results ---")
final_response = client.chat.completions.create(
model="stepfun-ai/Step-3.7-Flash",
messages=history_messages,
temperature=1.0,
stream=False
)
print("=============== Final Content =================")
print(final_response.choices[0].message.content)
else:
if message.content:
print("=============== Content =================")
print(message.content)
Note:
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
5. Benchmark
Benchmark results will be added soon.