1. Model Introduction
Hunyuan 3 Preview (Hy3-preview) is Tencent’s preview of its third-generation flagship MoE language model, featuring hybrid thinking, native tool calling, long-context reasoning, and Multi-Token Prediction (MTP) for low-latency serving. Key Features:- MoE Architecture: 192 routed experts + 1 shared expert, 8 experts activated per token. ~276B total parameters with ~20B active, delivering dense-model quality at MoE inference cost.
- Hybrid Thinking: Reasoning modes (
high,medium,low,none) controllable via OpenAI-standardreasoning_effort, allowing the same weights to trade off latency and depth of reasoning. - Native Tool Calling: Trained on structured
<tool_call>/<arg_key>/<arg_value>grammar. Pairs with SGLang’shunyuantool-call parser for streaming OpenAI-compatible function-calling output. - Long Context: 256K token context window (262,144 positions) for repository-scale code and document reasoning.
- Multi-Token Prediction (MTP): Ships with a built-in MTP draft module enabling speculative decoding out of the box.
- tencent/Hy3-preview — BF16 instruct
- tencent/Hy3-preview-Base — BF16 base
| Parameter | Value |
|---|---|
temperature | 0.7 |
top_p | 0.9 |
reasoning_effort | high / medium / low (thinking) or none (instant) |
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions. Docker Images by Hardware Platform:| Hardware Platform | Docker Image |
|---|---|
| NVIDIA H200 / B200 | lmsysorg/sglang:hy3-preview |
| NVIDIA B300 / GB300 | lmsysorg/sglang:hy3-preview-cu130 |
hy3-preview tag bundles the HYV3 model code, the hunyuan tool-call / reasoning parsers, and the MTP draft-module runtime.
3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization, and feature capabilities.3.2 Configuration Tips
Key Parameters:| Parameter | Description | Recommended Value |
|---|---|---|
--tool-call-parser | Tool call parser for function-calling support | hunyuan |
--reasoning-parser | Reasoning parser for hybrid thinking modes | hunyuan |
--trust-remote-code | Required for Hunyuan model loading | Always enabled |
--mem-fraction-static | Static memory fraction (KV + activations) | 0.9 |
--tp | Tensor parallelism size | 2 / 4 / 8 depending on hardware |
--attention-backend | Attention backend (Blackwell only) | trtllm_mha |
--speculative-algorithm | Speculative decoding via the bundled MTP draft | EAGLE + --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 (set env SGLANG_ENABLE_SPEC_V2=1) |
Hy3-preview, ~552GB weights)
- H200 (141GB) / B200 (180GB): TP=8 (minimum for BF16 to fit single-node).
- B300 (275GB) / GB300: TP=4.
- A100 / H100 (80GB): not supported single-node — BF16 requires multi-node TP=16+ on 80GB-class GPUs.
--attention-backend trtllm_mha explicitly on Blackwell hardware (the config generator above enforces this).
Multi-Token Prediction (MTP): The Hy3-preview release bundles an MTP draft module. SGLang runs it via its EAGLE speculative-decoding path — the draft module auto-loads from the same --model-path. Enable with the SGLANG_ENABLE_SPEC_V2=1 env var and the standard MTP flags:
Command
num-steps / num-draft-tokens based on acceptance rate in your workload.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to: Deployment Command (H200 × 8, BF16 default):Command
Command
Example
Output
reasoning_effort is not set, the server defaults to instant mode (no thinking, reasoning_content=None). To opt into thinking, pass reasoning_effort="high" / "medium" / "low" on the request — see the Hybrid Thinking section below.
4.2 Advanced Usage
4.2.1 Reasoning Parser (Hybrid Thinking)
Hy3-preview is a hybrid-thinking model. Control the thinking budget via the OpenAI-standardreasoning_effort:
high/medium/low— increasing amounts of chain-of-thought inreasoning_contentnone— skip thinking entirely (instant responses, content-only)
<think>...</think>) is separated into reasoning_content:
Command
Example
Output
Example
Output
4.2.2 Tool Calling
Hy3-preview supports streaming OpenAI-compatible tool calls. Enable both parsers together — the reasoning parser strips thinking tokens before the tool-call parser runs:Command
Example
Output
hunyuan tool-call parser emits tool names first, then argument JSON in incremental fragments — matching the OpenAI streaming contract:
Example
Output
5. Benchmark
5.1 Accuracy Benchmark
Test Environment:- Hardware: 8× NVIDIA H200 (141GB)
- Docker Image:
lmsysorg/sglang:hy3-preview - Model:
tencent/Hy3-preview(BF16) - Tensor Parallelism: 8
- SGLang version: latest
main
5.1.1 GSM8K
- Benchmark Method: 5-shot CoT on 200 questions, evaluated via SGLang native backend
- Benchmark Command:
Command
- Test Results:
Output
5.1.2 MMLU
- Benchmark Method: 5-shot, all 57 subjects
- Benchmark Command:
Command
- Test Results:
Output
5.1.3 Tool-Call Accuracy (MiniMax-Provider-Verifier)
- Benchmark Tool: MiniMax-Provider-Verifier
- Metric: function-call schema validity, argument match, and end-to-end response correctness
- Test Results:
Output
5.2 Speed Benchmark
5.2.1 Low Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.2.2 High Concurrency
- Benchmark Command:
Command
- Test Results:
Output
