Step-3.7-Flash (new) - SGLang Documentation

1. Model Introduction

Step-3.7-Flash is a 198B-parameter Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and supports a 256k context window with three selectable reasoning levels (low, medium, and high). The model is available in multiple quantization formats (BF16, FP8, NVFP4). Step-3.7-Flash is built for developers who need to scale agentic workflows that combine perception, search, and reasoning — from parsing massive financial reports in one pass, to running multi-step search loops with cross-source verification, to operating concurrent coding agents in high-throughput pipelines.

2. SGLang Installation

Step-3.7-Flash is currently available in SGLang via Docker image install.

Docker (NVIDIA)

Command

# Pull the docker image
docker pull lmsysorg/sglang:dev-step-3.7-flash

# Launch the container
docker run -it --gpus all \
  --shm-size=32g \
  --ipc=host \
  --network=host \
  lmsysorg/sglang:dev-step-3.7-flash bash

3. Model Deployment

This section provides deployment configurations optimized for different use cases.

3.1 Basic Configuration

The Step-3.7-Flash series comes in one size with multiple quantization options. Recommended starting configurations vary depending on hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities.

3.2 Configuration Tips

Memory: Requires GPUs with high VRAM capacity. Supported platforms: H200 (4x, TP=4), B200/B300 (4x, TP=4), GB200/GB300 (4x, TP=4).
NVFP4 Quantization: NVFP4 provides the smallest memory footprint. Requires --quantization modelopt_fp4 --kv-cache-dtype fp8_e4m3 --moe-runner-backend flashinfer_trtllm.
Trust Remote Code: All Step-3.7-Flash variants require --trust-remote-code due to the custom model architecture.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

Step-3.7-Flash supports image inputs alongside text. Here’s a basic example:

Example

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:30000/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                }
            },
            {
                "type": "text",
                "text": "Read all the text in the image."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.7-Flash",
    messages=messages,
    max_tokens=2048,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

Multi-Image Input Example: Step-3.7-Flash can process multiple images in a single request for comparison or analysis:

Example

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:30000/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://www.civitatis.com/f/china/hong-kong/guia/taxi.jpg"
                }
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://cdn.cheapoguides.com/wp-content/uploads/sites/7/2025/05/GettyImages-509614603-1280x600.jpg"
                }
            },
            {
                "type": "text",
                "text": "Compare these two images and describe the differences in 100 words or less."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.7-Flash",
    messages=messages,
    max_tokens=2048,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

4.2.2 Reasoning Parser

Step-3.7-Flash supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:

Command

sglang serve \
  --model-path stepfun-ai/Step-3.7-Flash \
  --tp 4 \
  --trust-remote-code \
  --reasoning-parser step3p5

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.7-Flash",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.7,
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

4.2.3 Tool Calling

Step-3.7-Flash supports tool calling capabilities. Enable the tool call parser: Start sglang server:

Command

sglang serve \
  --model-path stepfun-ai/Step-3.7-Flash \
  --tp 4 \
  --trust-remote-code \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5

Example

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# 1. define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"}
                },
                "required": ["location"]
            }
        }
    }
]

# 2. tool run
def get_weather(location, unit="celsius"):
    return f"The weather in {location} is 22 {unit[0].upper()} and sunny."

# 3. send first request
print("--- Sending first request ---")
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.7-Flash",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    temperature=1.0,
    stream=False
)

message = response.choices[0].message

# 4. Handle Reasoning Content
reasoning = getattr(message, 'reasoning_content', None)
if reasoning:
    print("=============== Thinking =================")
    print(reasoning)
    print("==========================================")

# 5. Handle Tool Calls
if message.tool_calls:
    print("\nTool Calls detected:")
    history_messages = [
        {"role": "user", "content": "What's the weather in Beijing?"},
        message
    ]

    for tool_call in message.tool_calls:
        print(f"   Tool: {tool_call.function.name}")
        print(f"   Args: {tool_call.function.arguments}")

        args = json.loads(tool_call.function.arguments)
        tool_result = get_weather(args.get("location"), args.get("unit", "celsius"))

        history_messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": tool_result
        })

    print("\n--- Sending tool results ---")
    final_response = client.chat.completions.create(
        model="stepfun-ai/Step-3.7-Flash",
        messages=history_messages,
        temperature=1.0,
        stream=False
    )

    print("=============== Final Content =================")
    print(final_response.choices[0].message.content)

else:
    if message.content:
        print("=============== Content =================")
        print(message.content)

Note:

The reasoning parser shows how the model decides to use a tool
Tool calls are clearly marked with the function name and arguments
You can then execute the function and send the result back to continue the conversation

5. Benchmark

Benchmark results will be added soon.

​1. Model Introduction

​2. SGLang Installation

​Docker (NVIDIA)

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Basic Usage

​4.2 Advanced Usage

​4.2.1 Multi-Modal Inputs

​4.2.2 Reasoning Parser

​4.2.3 Tool Calling

​5. Benchmark

1. Model Introduction

2. SGLang Installation

Docker (NVIDIA)

3. Model Deployment

3.1 Basic Configuration

3.2 Configuration Tips

4. Model Invocation

4.1 Basic Usage

4.2 Advanced Usage

4.2.1 Multi-Modal Inputs

4.2.2 Reasoning Parser

4.2.3 Tool Calling

5. Benchmark