Skip to main content

1. Model Introduction

Step-3.5-Flash is StepFun’s production-grade reasoning engine built to decouple elite intelligence from heavy compute, and cuts attention cost for low-latency, cost-effective long-context inference—purpose-built for autonomous agents in real-world workflows. The model is available in multiple quantization formats optimized for different hardware platforms. This generation delivers comprehensive upgrades across the board:
  • Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 3:1 ratio and an aggressive 128-token window. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.
  • Sparse Mixture-of-Experts: Only 11B active parameters out of 196B parameters.
  • Multi-Layer Multi-Token Prediction (MTP): Equipped with a 3-way Multi-Token Prediction (MTP-3). This allows for complex, multi-step reasoning chains with immediate responsiveness.

2.SGLang Installation

Step-3.5-Flash is currently available in SGLang via Docker image install.

Docker (NVIDIA)

Command
# Pull the docker image
docker pull lmsysorg/sglang:dev-pr-18084

# Launch the container
docker run -it --gpus all \
  --shm-size=32g \
  --ipc=host \
  --network=host \
  lmsysorg/sglang:dev-pr-18084 bash

Docker (AMD ROCm)

Command
# For MI300X/MI325X
docker pull lmsysorg/sglang:v0.5.9-rocm700-mi30x

# For MI350X/MI355X
docker pull lmsysorg/sglang:v0.5.9-rocm700-mi35x

docker run -it \
  --device=/dev/kfd --device=/dev/dri \
  --shm-size=32g \
  --ipc=host \
  --network=host \
  --group-add video --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  lmsysorg/sglang:v0.5.9-rocm700-mi30x bash  # or mi35x for MI350X/MI355X

3.Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

The Step-3.5-Flash series comes in only one sizes. Recommended starting configurations vary depending on hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities.

3.2 Configuration Tips

  • Memory: Requires GPUs with high VRAM capacity. Supported platforms: H200 (4×, TP=4), MI300X/MI325X/MI350X/MI355X (4×, TP=4 EP=4).
  • AMD Docker Image: Use lmsysorg/sglang:v0.5.9-rocm700-mi30x for MI300X/MI325X and lmsysorg/sglang:v0.5.9-rocm700-mi35x for MI350X/MI355X.
  • AMD Expert Parallelism Required: On AMD GPUs, always use --ep 4 with --tp 4. Both BF16 and FP8 models require expert parallelism. Without EP, the MoE intermediate dimension is split across GPUs (N=320), which triggers an AITER CK GEMM incompatibility. With EP=4, each GPU handles 72 full experts (N=1280), which works correctly with cuda graph enabled.
  • AITER JIT Compilation: First inference on AMD may take 30-40 seconds for AITER kernel JIT compilation. Subsequent requests use cached kernels.

4.Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

4.2.1 Reasoning Parser

Step-3.5-Flash only supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:
Command
sglang serve \
  --model-path stepfun-ai/Step-3.5-Flash \
  --tp 4 \
  --ep 4 \
  --reasoning-parser step3p5
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.5-Flash",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.7,
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()
Output Example:
Output
=============== Thinking =================
We are asked: "What is 15% of 240?" We need to solve step by step.

Step 1: Understand that "15% of 240" means we need to calculate 15 percent of 240. In mathematical terms, it is (15/100) * 240.

Step 2: Simplify the calculation. We can compute 15% of 240 by first finding 10% of 240 and then 5% of 240, and adding them. Alternatively, we can multiply directly.

Method 1:
10% of 240 = 240 * 0.10 = 24.
5% is half of 10%, so 5% of 240 = 24 / 2 = 12.
Then 15% = 10% + 5% = 24 + 12 = 36.

Method 2: Direct multiplication: 15% = 15/100 = 0.15, so 0.15 * 240 = 36.

We can also compute fractionally: (15/100)*240 = (15*240)/100. 15*240 = 3600, divided by 100 gives 36.

Thus, the answer is 36.

We'll present the solution step by step.

=============== Content =================

To find 15% of 240, follow these steps:

1. **Convert the percentage to a decimal**:
   \( 15\% = \frac{15}{100} = 0.15 \)

2. **Multiply by the number**:
   \( 0.15 \times 240 = 36 \)

Alternatively, break it down:
- \( 10\% \text{ of } 240 = 240 \times 0.10 = 24 \)
- \( 5\% \text{ of } 240 = \frac{24}{2} = 12 \) (since 5% is half of 10%)
- \( 15\% = 10\% + 5\% = 24 + 12 = 36 \)

**Answer:** 36

4.2.2 Tool Calling

Step-3.5 supports tool calling capabilities. Enable the tool call parser: Python Example: Start sglang server:
Command
sglang serve \
  --model-path stepfun-ai/Step-3.5-Flash \
  --tp 4 \
  --ep 4 \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5
Example
from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# 1. define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"}
                },
                "required": ["location"]
            }
        }
    }
]

# 2. tool run
def get_weather(location, unit="celsius"):
    return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# 3. send first request
print("--- Sending first request ---")
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.5-Flash",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    temperature=1.0,
    stream=False
)

message = response.choices[0].message

# 4. Handle Reasoning Content
reasoning = getattr(message, 'reasoning_content', None)
if reasoning:
    print("=============== Thinking =================")
    print(reasoning)
    print("==========================================")

# 5. Handle Tool Calls
if message.tool_calls:
    print("\n🔧 Tool Calls detected:")
    history_messages = [
        {"role": "user", "content": "What's the weather in Beijing?"},
        message
    ]

    for tool_call in message.tool_calls:
        print(f"   Tool: {tool_call.function.name}")
        print(f"   Args: {tool_call.function.arguments}")

        args = json.loads(tool_call.function.arguments)
        tool_result = get_weather(args.get("location"), args.get("unit", "celsius"))

        history_messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": tool_result
        })

    print("\n--- Sending tool results ---")
    final_response = client.chat.completions.create(
        model="stepfun-ai/Step-3.5-Flash",
        messages=history_messages,
        temperature=1.0,
        stream=False
    )

    print("=============== Final Content =================")
    print(final_response.choices[0].message.content)

else:
    if message.content:
        print("=============== Content =================")
        print(message.content)
Output Example:
Output
--- Sending first request ---
=============== Thinking =================
The user is asking for the weather in Beijing. I should use the get_weather function with location="Beijing". The unit parameter is optional and the user didn't specify a preference, so I'll leave it out (the default should be fine).

==========================================

🔧 Tool Calls detected:
   Tool: get_weather
   Args: {"location": "Beijing"}

--- Sending tool results ---
=============== Final Content =================
The weather in Beijing is 22°C and sunny.
Note:
  • The reasoning parser shows how the model decides to use a tool
  • Tool calls are clearly marked with the function name and arguments
  • You can then execute the function and send the result back to continue the conversation

5. Benchmark

5.1 Speed Benchmark

Test Environment:
  • Hardware: NVIDIA H200 GPU (4x)
  • Model: Step-3.5-Flash
  • Tensor Parallelism: 4
  • Expert Parallelism: 4
  • sglang version: 0.5.8
We use SGLang’s built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios.

5.1.1 Standard Scenario Benchmark

  • Model Deployment Command:
Command
sglang serve \
  --model-path stepfun-ai/Step-3.5-Flash \
  --tp 4 \
  --ep 4
5.1.1.1 Low Concurrency
  • Benchmark Command:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --model stepfun-ai/Step-3.5-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1
  • Test Results:
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  35.30
Total input tokens:                      6091
Total input text tokens:                 6091
Total generated tokens:                  4220
Total generated tokens (retokenized):    4212
Request throughput (req/s):              0.28
Input token throughput (tok/s):          172.57
Output token throughput (tok/s):         119.56
Peak output token throughput (tok/s):    124.00
Peak concurrent requests:                2
Total token throughput (tok/s):          292.14
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3527.94
Median E2E Latency (ms):                 2884.72
P90 E2E Latency (ms):                    6350.38
P99 E2E Latency (ms):                    7858.53
---------------Time to First Token----------------
Mean TTFT (ms):                          107.53
Median TTFT (ms):                        80.93
P99 TTFT (ms):                           269.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.12
Median TPOT (ms):                        8.13
P99 TPOT (ms):                           8.14
---------------Inter-Token Latency----------------
Mean ITL (ms):                           8.12
Median ITL (ms):                         8.11
P95 ITL (ms):                            8.61
P99 ITL (ms):                            8.91
Max ITL (ms):                            20.77
==================================================
5.1.1.2 Medium Concurrency
  • Benchmark Command:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --model stepfun-ai/Step-3.5-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 80 \
  --max-concurrency 16
  • Test Results:
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  54.06
Total input tokens:                      39588
Total input text tokens:                 39588
Total generated tokens:                  40805
Total generated tokens (retokenized):    40479
Request throughput (req/s):              1.48
Input token throughput (tok/s):          732.33
Output token throughput (tok/s):         754.84
Peak output token throughput (tok/s):    928.00
Peak concurrent requests:                21
Total token throughput (tok/s):          1487.17
Concurrency:                             14.06
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   9501.23
Median E2E Latency (ms):                 10010.71
P90 E2E Latency (ms):                    15655.09
P99 E2E Latency (ms):                    18803.63
---------------Time to First Token----------------
Mean TTFT (ms):                          198.34
Median TTFT (ms):                        89.50
P99 TTFT (ms):                           984.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.97
Median TPOT (ms):                        18.80
P99 TPOT (ms):                           35.67
---------------Inter-Token Latency----------------
Mean ITL (ms):                           18.27
Median ITL (ms):                         17.48
P95 ITL (ms):                            18.44
P99 ITL (ms):                            62.47
Max ITL (ms):                            460.85
==================================================
5.1.1.3 High Concurrency
  • Benchmark Command:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --model stepfun-ai/Step-3.5-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 500 \
  --max-concurrency 100
  • Test Results:
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     500
Benchmark duration (s):                  125.88
Total input tokens:                      249331
Total input text tokens:                 249331
Total generated tokens:                  252662
Total generated tokens (retokenized):    251323
Request throughput (req/s):              3.97
Input token throughput (tok/s):          1980.77
Output token throughput (tok/s):         2007.23
Peak output token throughput (tok/s):    2500.00
Peak concurrent requests:                109
Total token throughput (tok/s):          3987.99
Concurrency:                             92.25
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23223.31
Median E2E Latency (ms):                 22631.90
P90 E2E Latency (ms):                    42269.38
P99 E2E Latency (ms):                    47637.53
---------------Time to First Token----------------
Mean TTFT (ms):                          372.13
Median TTFT (ms):                        127.26
P99 TTFT (ms):                           1880.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.06
Median TPOT (ms):                        47.61
P99 TPOT (ms):                           51.34
---------------Inter-Token Latency----------------
Mean ITL (ms):                           45.31
Median ITL (ms):                         39.86
P95 ITL (ms):                            72.49
P99 ITL (ms):                            117.05
Max ITL (ms):                            1359.81
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

  • Benchmark Command:
Command
python3 -m sglang.test.few_shot_gsm8k --num-questions 200
  • Results:
    • Step-3.5-Flash
      Accuracy: 0.885
      Invalid: 0.005
      Latency: 9.986 s
      Output throughput: 1972.911 token/s