Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.sglang.io/llms.txt

Use this file to discover all available pages before exploring further.

1. Model Introduction

Step-3.7-Flash is a 198B-parameter Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and supports a 256k context window with three selectable reasoning levels (low, medium, and high). The model is available in multiple quantization formats (BF16, FP8, NVFP4). Step-3.7-Flash is built for developers who need to scale agentic workflows that combine perception, search, and reasoning — from parsing massive financial reports in one pass, to running multi-step search loops with cross-source verification, to operating concurrent coding agents in high-throughput pipelines.

2. SGLang Installation

Step-3.7-Flash is currently available in SGLang via Docker image install.

Docker (NVIDIA)

Command
# Pull the docker image
docker pull lmsysorg/sglang:dev-step-3.7-flash

# Launch the container
docker run -it --gpus all \
  --shm-size=32g \
  --ipc=host \
  --network=host \
  lmsysorg/sglang:dev-step-3.7-flash bash

3. Model Deployment

This section provides deployment configurations optimized for different use cases.

3.1 Basic Configuration

The Step-3.7-Flash series comes in one size with multiple quantization options. Recommended starting configurations vary depending on hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities.

3.2 Configuration Tips

  • Memory: Requires GPUs with high VRAM capacity. Supported platforms: H200 (4x, TP=4), B200/B300 (4x, TP=4), GB200/GB300 (4x, TP=4).
  • NVFP4 Quantization: NVFP4 provides the smallest memory footprint. Requires --quantization modelopt_fp4 --kv-cache-dtype fp8_e4m3 --moe-runner-backend flashinfer_trtllm.
  • Trust Remote Code: All Step-3.7-Flash variants require --trust-remote-code due to the custom model architecture.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

4.2.1 Multi-Modal Inputs

Step-3.7-Flash supports image inputs alongside text. Here’s a basic example:
Example
import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:30000/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                }
            },
            {
                "type": "text",
                "text": "Read all the text in the image."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.7-Flash",
    messages=messages,
    max_tokens=2048,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
Multi-Image Input Example: Step-3.7-Flash can process multiple images in a single request for comparison or analysis:
Example
import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:30000/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://www.civitatis.com/f/china/hong-kong/guia/taxi.jpg"
                }
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://cdn.cheapoguides.com/wp-content/uploads/sites/7/2025/05/GettyImages-509614603-1280x600.jpg"
                }
            },
            {
                "type": "text",
                "text": "Compare these two images and describe the differences in 100 words or less."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.7-Flash",
    messages=messages,
    max_tokens=2048,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

4.2.2 Reasoning Parser

Step-3.7-Flash supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:
Command
sglang serve \
  --model-path stepfun-ai/Step-3.7-Flash \
  --tp 4 \
  --trust-remote-code \
  --reasoning-parser step3p5
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.7-Flash",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.7,
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

4.2.3 Tool Calling

Step-3.7-Flash supports tool calling capabilities. Enable the tool call parser: Start sglang server:
Command
sglang serve \
  --model-path stepfun-ai/Step-3.7-Flash \
  --tp 4 \
  --trust-remote-code \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5
Example
from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# 1. define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"}
                },
                "required": ["location"]
            }
        }
    }
]

# 2. tool run
def get_weather(location, unit="celsius"):
    return f"The weather in {location} is 22 {unit[0].upper()} and sunny."

# 3. send first request
print("--- Sending first request ---")
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.7-Flash",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    temperature=1.0,
    stream=False
)

message = response.choices[0].message

# 4. Handle Reasoning Content
reasoning = getattr(message, 'reasoning_content', None)
if reasoning:
    print("=============== Thinking =================")
    print(reasoning)
    print("==========================================")

# 5. Handle Tool Calls
if message.tool_calls:
    print("\nTool Calls detected:")
    history_messages = [
        {"role": "user", "content": "What's the weather in Beijing?"},
        message
    ]

    for tool_call in message.tool_calls:
        print(f"   Tool: {tool_call.function.name}")
        print(f"   Args: {tool_call.function.arguments}")

        args = json.loads(tool_call.function.arguments)
        tool_result = get_weather(args.get("location"), args.get("unit", "celsius"))

        history_messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": tool_result
        })

    print("\n--- Sending tool results ---")
    final_response = client.chat.completions.create(
        model="stepfun-ai/Step-3.7-Flash",
        messages=history_messages,
        temperature=1.0,
        stream=False
    )

    print("=============== Final Content =================")
    print(final_response.choices[0].message.content)

else:
    if message.content:
        print("=============== Content =================")
        print(message.content)
Note:
  • The reasoning parser shows how the model decides to use a tool
  • Tool calls are clearly marked with the function name and arguments
  • You can then execute the function and send the result back to continue the conversation

5. Benchmark

Benchmark results will be added soon.