Skip to main content

1. Model Introduction

Hunyuan 3 Preview (Hy3-preview) is Tencent’s preview of its third-generation flagship MoE language model, featuring hybrid thinking, native tool calling, long-context reasoning, and Multi-Token Prediction (MTP) for low-latency serving. Key Features:
  • MoE Architecture: 192 routed experts + 1 shared expert, 8 experts activated per token. ~276B total parameters with ~20B active, delivering dense-model quality at MoE inference cost.
  • Hybrid Thinking: Reasoning modes (high, medium, low, none) controllable via OpenAI-standard reasoning_effort, allowing the same weights to trade off latency and depth of reasoning.
  • Native Tool Calling: Trained on structured <tool_call> / <arg_key> / <arg_value> grammar. Pairs with SGLang’s hunyuan tool-call parser for streaming OpenAI-compatible function-calling output.
  • Long Context: 256K token context window (262,144 positions) for repository-scale code and document reasoning.
  • Multi-Token Prediction (MTP): Ships with a built-in MTP draft module enabling speculative decoding out of the box.
Available Models: Recommended Generation Parameters:
ParameterValue
temperature0.7
top_p0.9
reasoning_efforthigh / medium / low (thinking) or none (instant)
License: TODO — verify on HuggingFace model card.

2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions. Docker Images by Hardware Platform:
Hardware PlatformDocker Image
NVIDIA H200 / B200lmsysorg/sglang:hy3-preview
NVIDIA B300 / GB300lmsysorg/sglang:hy3-preview-cu130
The hy3-preview tag bundles the HYV3 model code, the hunyuan tool-call / reasoning parsers, and the MTP draft-module runtime.

3. Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization, and feature capabilities.

3.2 Configuration Tips

Key Parameters:
ParameterDescriptionRecommended Value
--tool-call-parserTool call parser for function-calling supporthunyuan
--reasoning-parserReasoning parser for hybrid thinking modeshunyuan
--trust-remote-codeRequired for Hunyuan model loadingAlways enabled
--mem-fraction-staticStatic memory fraction (KV + activations)0.9
--tpTensor parallelism size2 / 4 / 8 depending on hardware
--attention-backendAttention backend (Blackwell only)trtllm_mha
--speculative-algorithmSpeculative decoding via the bundled MTP draftEAGLE + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 (set env SGLANG_ENABLE_SPEC_V2=1)
Hardware Requirements: NVIDIA BF16 (Hy3-preview, ~552GB weights)
  • H200 (141GB) / B200 (180GB): TP=8 (minimum for BF16 to fit single-node).
  • B300 (275GB) / GB300: TP=4.
  • A100 / H100 (80GB): not supported single-node — BF16 requires multi-node TP=16+ on 80GB-class GPUs.
Blackwell (B200 / B300 / GB300): Auto-selected attention backend can mis-route for HYV3 on Blackwell. Always pass --attention-backend trtllm_mha explicitly on Blackwell hardware (the config generator above enforces this). Multi-Token Prediction (MTP): The Hy3-preview release bundles an MTP draft module. SGLang runs it via its EAGLE speculative-decoding path — the draft module auto-loads from the same --model-path. Enable with the SGLANG_ENABLE_SPEC_V2=1 env var and the standard MTP flags:
Command
SGLANG_ENABLE_SPEC_V2=1 sglang serve \
  --model-path tencent/Hy3-preview \
  --tp 8 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --reasoning-parser hunyuan \
  --tool-call-parser hunyuan \
  --trust-remote-code \
  --mem-fraction-static 0.85
Toggle the “Speculative Decoding (MTP)” option in the generator above to add these flags automatically. Tune num-steps / num-draft-tokens based on acceptance rate in your workload.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to: Deployment Command (H200 × 8, BF16 default):
Command
sglang serve \
  --model-path tencent/Hy3-preview \
  --tp 8 \
  --reasoning-parser hunyuan \
  --tool-call-parser hunyuan \
  --trust-remote-code \
  --mem-fraction-static 0.9
Testing Deployment: After startup, you can test the SGLang OpenAI-compatible API with the following command:
Command
curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tencent/Hy3-preview",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
Simple Completion Example:
Example
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="tencent/Hy3-preview",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    max_tokens=1024
)

print("Reasoning:", response.choices[0].message.reasoning_content)
print("Content:  ", response.choices[0].message.content)
Output Example:
Output
Reasoning: None
Content:   The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in six games (4-2). This was the Dodgers' first World Series championship since 1988. The series was notable for being played in a neutral-site bubble at Globe Life Field in Arlington, Texas, due to the COVID-19 pandemic.
When reasoning_effort is not set, the server defaults to instant mode (no thinking, reasoning_content=None). To opt into thinking, pass reasoning_effort="high" / "medium" / "low" on the request — see the Hybrid Thinking section below.

4.2 Advanced Usage

4.2.1 Reasoning Parser (Hybrid Thinking)

Hy3-preview is a hybrid-thinking model. Control the thinking budget via the OpenAI-standard reasoning_effort:
  • high / medium / low — increasing amounts of chain-of-thought in reasoning_content
  • none — skip thinking entirely (instant responses, content-only)
Enable the reasoning parser during deployment so that the thinking section (<think>...</think>) is separated into reasoning_content:
Command
sglang serve \
  --model-path tencent/Hy3-preview \
  --tp 8 \
  --reasoning-parser hunyuan \
  --trust-remote-code \
  --mem-fraction-static 0.9
Thinking Mode — High Effort:
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="tencent/Hy3-preview",
    messages=[{"role": "user", "content": "Solve step by step: What is 15% of 240?"}],
    reasoning_effort="high",
    max_tokens=2048,
)

msg = response.choices[0].message
print("=============== Thinking =================")
print(msg.reasoning_content)
print("=============== Content =================")
print(msg.content)
Output Example:
Output
=============== Thinking =================
We need to solve: "What is 15% of 240?" Step by step. So we need to compute 15% of 240. The process: 15% means 15 per hundred, i.e., 15/100 = 0.15. Multiply 0.15 by 240. Or we can do: 10% of 240 = 24, 5% is half of 10% = 12, so sum = 36. Or do multiplication: 15/100 * 240 = (15*240)/100 = (3600)/100 = 36. So answer is 36.

We need to produce step-by-step explanation. The instruction: "Solve step by step: What is 15% of 240?" So we should provide a clear solution with steps. The final answer: 36. Also maybe include units? No units.

We'll output the solution in a clear manner.
=============== Content =================
To find 15% of 240, follow these steps:

1. **Understand that percent means "per hundred."**
   So, 15% = 15/100 or 0.15.

2. **Multiply the number (240) by the percentage in decimal form.**
   0.15 × 240.

   Alternatively, you can use fractions:
   (15/100) × 240.

3. **Perform the multiplication.**
   0.15 × 240 = 36.
   Or:
   (15 × 240) / 100 = 3600 / 100 = 36.

4. **Check using an alternative method:**
   - 10% of 240 = 24.
   - 5% of 240 = half of 10% = 12.
   - 15% = 10% + 5% = 24 + 12 = 36.

Thus, **15% of 240 is 36**.
Instant Mode — No Thinking:
Example
response = client.chat.completions.create(
    model="tencent/Hy3-preview",
    messages=[{"role": "user", "content": "Give me a one-line summary of relativity."}],
    reasoning_effort="none",
    max_tokens=256,
)

print("Content:", response.choices[0].message.content)
Output Example:
Output
Content: Relativity is Einstein's theory that space, time, mass, and gravity are interconnected and relative, not fixed, fundamentally changing our understanding of the universe.

4.2.2 Tool Calling

Hy3-preview supports streaming OpenAI-compatible tool calls. Enable both parsers together — the reasoning parser strips thinking tokens before the tool-call parser runs:
Command
sglang serve \
  --model-path tencent/Hy3-preview \
  --tp 8 \
  --reasoning-parser hunyuan \
  --tool-call-parser hunyuan \
  --trust-remote-code \
  --mem-fraction-static 0.9
Non-Streaming Example:
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="tencent/Hy3-preview",
    messages=[{"role": "user", "content": "What's the weather in Beijing? Use fahrenheit."}],
    tools=tools,
)

msg = response.choices[0].message
print("Reasoning:", msg.reasoning_content)
print("Content:  ", msg.content)
for tc in msg.tool_calls or []:
    print(f"Tool Call: {tc.function.name}")
    print(f"  Arguments: {tc.function.arguments}")
Output Example:
Output
Reasoning: None
Content:   I'll get the current weather for Beijing in Fahrenheit for you.
Tool Call: get_weather
  Arguments: {"city": "Beijing", "unit": "fahrenheit"}
Streaming Example (incremental argument deltas): Hy3-preview’s hunyuan tool-call parser emits tool names first, then argument JSON in incremental fragments — matching the OpenAI streaming contract:
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

stream = client.chat.completions.create(
    model="tencent/Hy3-preview",
    messages=[{"role": "user", "content": "What's the weather in Beijing? Use fahrenheit."}],
    tools=tools,
    stream=True,
)

tool_buffer = {}
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
    for tc in delta.tool_calls or []:
        buf = tool_buffer.setdefault(tc.index, {"name": "", "args": ""})
        if tc.function and tc.function.name:
            buf["name"] += tc.function.name
        if tc.function and tc.function.arguments:
            buf["args"] += tc.function.arguments

for idx, buf in tool_buffer.items():
    print(f"\nTool[{idx}] {buf['name']}({buf['args']})")
Output Example:
Output
I'll check the current weather in Beijing for you using Fahrenheit.
Tool[0] get_weather({"city": "Beijing", "unit": "fahrenheit"})

5. Benchmark

5.1 Accuracy Benchmark

Test Environment:
  • Hardware: 8× NVIDIA H200 (141GB)
  • Docker Image: lmsysorg/sglang:hy3-preview
  • Model: tencent/Hy3-preview (BF16)
  • Tensor Parallelism: 8
  • SGLang version: latest main

5.1.1 GSM8K

  • Benchmark Method: 5-shot CoT on 200 questions, evaluated via SGLang native backend
  • Benchmark Command:
Command
python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 64
  • Test Results:
Output
TODO — replace with real GSM8K accuracy after benchmark run on Hy3-preview (BF16).

5.1.2 MMLU

  • Benchmark Method: 5-shot, all 57 subjects
  • Benchmark Command:
Command
python3 benchmark/mmlu/bench_sglang.py --nsub 60 --parallel 64
  • Test Results:
Output
TODO — replace with real MMLU accuracy after benchmark run on Hy3-preview (BF16).

5.1.3 Tool-Call Accuracy (MiniMax-Provider-Verifier)

  • Benchmark Tool: MiniMax-Provider-Verifier
  • Metric: function-call schema validity, argument match, and end-to-end response correctness
  • Test Results:
Output
TODO — replace with real tool-call accuracy after benchmark run on Hy3-preview (BF16).

5.2 Speed Benchmark

5.2.1 Low Concurrency

  • Benchmark Command:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --model tencent/Hy3-preview \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1
  • Test Results:
Output
TODO — replace with real low-concurrency output on Hy3-preview (BF16).

5.2.2 High Concurrency

  • Benchmark Command:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --model tencent/Hy3-preview \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 500 \
  --max-concurrency 100
  • Test Results:
Output
TODO — replace with real high-concurrency output on Hy3-preview (BF16).