Step-3.5 - SGLang Documentation

1. Model Introduction

Step-3.5-Flash is StepFun’s production-grade reasoning engine built to decouple elite intelligence from heavy compute, and cuts attention cost for low-latency, cost-effective long-context inference—purpose-built for autonomous agents in real-world workflows. The model is available in multiple quantization formats optimized for different hardware platforms. This generation delivers comprehensive upgrades across the board:

Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 3:1 ratio and an aggressive 128-token window. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.
Sparse Mixture-of-Experts: Only 11B active parameters out of 196B parameters.
Multi-Layer Multi-Token Prediction (MTP): Equipped with a 3-way Multi-Token Prediction (MTP-3). This allows for complex, multi-step reasoning chains with immediate responsiveness.

2.SGLang Installation

Step-3.5-Flash is currently available in SGLang via Docker image install.

Docker (NVIDIA)

Command

# Pull the docker image
docker pull lmsysorg/sglang:dev-pr-18084

# Launch the container
docker run -it --gpus all \
  --shm-size=32g \
  --ipc=host \
  --network=host \
  lmsysorg/sglang:dev-pr-18084 bash

Docker (AMD ROCm)

Command

# For MI300X/MI325X
docker pull lmsysorg/sglang:v0.5.9-rocm700-mi30x

# For MI350X/MI355X
docker pull lmsysorg/sglang:v0.5.9-rocm700-mi35x

docker run -it \
  --device=/dev/kfd --device=/dev/dri \
  --shm-size=32g \
  --ipc=host \
  --network=host \
  --group-add video --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  lmsysorg/sglang:v0.5.9-rocm700-mi30x bash  # or mi35x for MI350X/MI355X

3.Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

The Step-3.5-Flash series comes in only one sizes. Recommended starting configurations vary depending on hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities.

3.2 Configuration Tips

Memory: Requires GPUs with high VRAM capacity. Supported platforms: H200 (4×, TP=4), MI300X/MI325X/MI350X/MI355X (4×, TP=4 EP=4).
AMD Docker Image: Use lmsysorg/sglang:v0.5.9-rocm700-mi30x for MI300X/MI325X and lmsysorg/sglang:v0.5.9-rocm700-mi35x for MI350X/MI355X.
AMD Expert Parallelism Required: On AMD GPUs, always use --ep 4 with --tp 4. Both BF16 and FP8 models require expert parallelism. Without EP, the MoE intermediate dimension is split across GPUs (N=320), which triggers an AITER CK GEMM incompatibility. With EP=4, each GPU handles 72 full experts (N=1280), which works correctly with cuda graph enabled.
AITER JIT Compilation: First inference on AMD may take 30-40 seconds for AITER kernel JIT compilation. Subsequent requests use cached kernels.

4.Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

SGLang Basic Usage Guide

4.2 Advanced Usage

4.2.1 Reasoning Parser

Step-3.5-Flash only supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:

Command

sglang serve \
  --model-path stepfun-ai/Step-3.5-Flash \
  --tp 4 \
  --ep 4 \
  --reasoning-parser step3p5

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.5-Flash",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.7,
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

Output Example:

Output

=============== Thinking =================
We are asked: "What is 15% of 240?" We need to solve step by step.

Step 1: Understand that "15% of 240" means we need to calculate 15 percent of 240. In mathematical terms, it is (15/100) * 240.

Step 2: Simplify the calculation. We can compute 15% of 240 by first finding 10% of 240 and then 5% of 240, and adding them. Alternatively, we can multiply directly.

Method 1:
10% of 240 = 240 * 0.10 = 24.
5% is half of 10%, so 5% of 240 = 24 / 2 = 12.
Then 15% = 10% + 5% = 24 + 12 = 36.

Method 2: Direct multiplication: 15% = 15/100 = 0.15, so 0.15 * 240 = 36.

We can also compute fractionally: (15/100)*240 = (15*240)/100. 15*240 = 3600, divided by 100 gives 36.

Thus, the answer is 36.

We'll present the solution step by step.

=============== Content =================

To find 15% of 240, follow these steps:

1. **Convert the percentage to a decimal**:
   \( 15\% = \frac{15}{100} = 0.15 \)

2. **Multiply by the number**:
   \( 0.15 \times 240 = 36 \)

Alternatively, break it down:
- \( 10\% \text{ of } 240 = 240 \times 0.10 = 24 \)
- \( 5\% \text{ of } 240 = \frac{24}{2} = 12 \) (since 5% is half of 10%)
- \( 15\% = 10\% + 5\% = 24 + 12 = 36 \)

**Answer:** 36

4.2.2 Tool Calling

Step-3.5 supports tool calling capabilities. Enable the tool call parser: Python Example: Start sglang server:

Command

sglang serve \
  --model-path stepfun-ai/Step-3.5-Flash \
  --tp 4 \
  --ep 4 \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5

Example

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# 1. define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"}
                },
                "required": ["location"]
            }
        }
    }
]

# 2. tool run
def get_weather(location, unit="celsius"):
    return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# 3. send first request
print("--- Sending first request ---")
response = client.chat.completions.create(
    model="stepfun-ai/Step-3.5-Flash",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    temperature=1.0,
    stream=False
)

message = response.choices[0].message

# 4. Handle Reasoning Content
reasoning = getattr(message, 'reasoning_content', None)
if reasoning:
    print("=============== Thinking =================")
    print(reasoning)
    print("==========================================")

# 5. Handle Tool Calls
if message.tool_calls:
    print("\n🔧 Tool Calls detected:")
    history_messages = [
        {"role": "user", "content": "What's the weather in Beijing?"},
        message
    ]

    for tool_call in message.tool_calls:
        print(f"   Tool: {tool_call.function.name}")
        print(f"   Args: {tool_call.function.arguments}")

        args = json.loads(tool_call.function.arguments)
        tool_result = get_weather(args.get("location"), args.get("unit", "celsius"))

        history_messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": tool_result
        })

    print("\n--- Sending tool results ---")
    final_response = client.chat.completions.create(
        model="stepfun-ai/Step-3.5-Flash",
        messages=history_messages,
        temperature=1.0,
        stream=False
    )

    print("=============== Final Content =================")
    print(final_response.choices[0].message.content)

else:
    if message.content:
        print("=============== Content =================")
        print(message.content)

Output Example:

Output

--- Sending first request ---
=============== Thinking =================
The user is asking for the weather in Beijing. I should use the get_weather function with location="Beijing". The unit parameter is optional and the user didn't specify a preference, so I'll leave it out (the default should be fine).

==========================================

🔧 Tool Calls detected:
   Tool: get_weather
   Args: {"location": "Beijing"}

--- Sending tool results ---
=============== Final Content =================
The weather in Beijing is 22°C and sunny.

Note:

The reasoning parser shows how the model decides to use a tool
Tool calls are clearly marked with the function name and arguments
You can then execute the function and send the result back to continue the conversation

5. Benchmark

5.1 Speed Benchmark

Test Environment:

Hardware: NVIDIA H200 GPU (4x)
Model: Step-3.5-Flash
Tensor Parallelism: 4
Expert Parallelism: 4
sglang version: 0.5.8

We use SGLang’s built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios.

5.1.1 Standard Scenario Benchmark

Model Deployment Command:

Command

sglang serve \
  --model-path stepfun-ai/Step-3.5-Flash \
  --tp 4 \
  --ep 4

5.1.1.1 Low Concurrency

Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --model stepfun-ai/Step-3.5-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  35.30
Total input tokens:                      6091
Total input text tokens:                 6091
Total generated tokens:                  4220
Total generated tokens (retokenized):    4212
Request throughput (req/s):              0.28
Input token throughput (tok/s):          172.57
Output token throughput (tok/s):         119.56
Peak output token throughput (tok/s):    124.00
Peak concurrent requests:                2
Total token throughput (tok/s):          292.14
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3527.94
Median E2E Latency (ms):                 2884.72
P90 E2E Latency (ms):                    6350.38
P99 E2E Latency (ms):                    7858.53
---------------Time to First Token----------------
Mean TTFT (ms):                          107.53
Median TTFT (ms):                        80.93
P99 TTFT (ms):                           269.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.12
Median TPOT (ms):                        8.13
P99 TPOT (ms):                           8.14
---------------Inter-Token Latency----------------
Mean ITL (ms):                           8.12
Median ITL (ms):                         8.11
P95 ITL (ms):                            8.61
P99 ITL (ms):                            8.91
Max ITL (ms):                            20.77
==================================================

5.1.1.2 Medium Concurrency

Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --model stepfun-ai/Step-3.5-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 80 \
  --max-concurrency 16

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  54.06
Total input tokens:                      39588
Total input text tokens:                 39588
Total generated tokens:                  40805
Total generated tokens (retokenized):    40479
Request throughput (req/s):              1.48
Input token throughput (tok/s):          732.33
Output token throughput (tok/s):         754.84
Peak output token throughput (tok/s):    928.00
Peak concurrent requests:                21
Total token throughput (tok/s):          1487.17
Concurrency:                             14.06
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   9501.23
Median E2E Latency (ms):                 10010.71
P90 E2E Latency (ms):                    15655.09
P99 E2E Latency (ms):                    18803.63
---------------Time to First Token----------------
Mean TTFT (ms):                          198.34
Median TTFT (ms):                        89.50
P99 TTFT (ms):                           984.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.97
Median TPOT (ms):                        18.80
P99 TPOT (ms):                           35.67
---------------Inter-Token Latency----------------
Mean ITL (ms):                           18.27
Median ITL (ms):                         17.48
P95 ITL (ms):                            18.44
P99 ITL (ms):                            62.47
Max ITL (ms):                            460.85
==================================================

5.1.1.3 High Concurrency

Benchmark Command:

Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --model stepfun-ai/Step-3.5-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 500 \
  --max-concurrency 100

Test Results:

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     500
Benchmark duration (s):                  125.88
Total input tokens:                      249331
Total input text tokens:                 249331
Total generated tokens:                  252662
Total generated tokens (retokenized):    251323
Request throughput (req/s):              3.97
Input token throughput (tok/s):          1980.77
Output token throughput (tok/s):         2007.23
Peak output token throughput (tok/s):    2500.00
Peak concurrent requests:                109
Total token throughput (tok/s):          3987.99
Concurrency:                             92.25
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23223.31
Median E2E Latency (ms):                 22631.90
P90 E2E Latency (ms):                    42269.38
P99 E2E Latency (ms):                    47637.53
---------------Time to First Token----------------
Mean TTFT (ms):                          372.13
Median TTFT (ms):                        127.26
P99 TTFT (ms):                           1880.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.06
Median TPOT (ms):                        47.61
P99 TPOT (ms):                           51.34
---------------Inter-Token Latency----------------
Mean ITL (ms):                           45.31
Median ITL (ms):                         39.86
P95 ITL (ms):                            72.49
P99 ITL (ms):                            117.05
Max ITL (ms):                            1359.81
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

Benchmark Command:

Command

python3 -m sglang.test.few_shot_gsm8k --num-questions 200

Results:

Step-3.5-Flash

Accuracy: 0.885
Invalid: 0.005
Latency: 9.986 s
Output throughput: 1972.911 token/s

Cookbook

​1. Model Introduction

​2.SGLang Installation

​Docker (NVIDIA)

​Docker (AMD ROCm)

​3.Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4.Model Invocation

​4.1 Basic Usage

​4.2 Advanced Usage

​4.2.1 Reasoning Parser

​4.2.2 Tool Calling

​5. Benchmark

​5.1 Speed Benchmark

​5.1.1 Standard Scenario Benchmark

5.1.1.1 Low Concurrency

5.1.1.2 Medium Concurrency

5.1.1.3 High Concurrency

​5.2 Accuracy Benchmark

​5.2.1 GSM8K Benchmark

1. Model Introduction

2.SGLang Installation

Docker (NVIDIA)

Docker (AMD ROCm)

3.Model Deployment

3.1 Basic Configuration

3.2 Configuration Tips

4.Model Invocation

4.1 Basic Usage

4.2 Advanced Usage

4.2.1 Reasoning Parser

4.2.2 Tool Calling

5. Benchmark

5.1 Speed Benchmark

5.1.1 Standard Scenario Benchmark

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark