1. Model Introduction
Ring-2.6-1T is InclusionAI’s trillion-parameter flagship reasoning model for real-world complex task execution. It targets agent workflows, engineering development, scientific research analysis, enterprise automation, and other long-horizon settings where the model must plan, use tools, recover from intermediate errors, and keep context across multiple steps. Key Features:- Trillion-Scale Reasoning Model:
BailingMoeV2_5ForCausalLMwith abailing_hybridarchitecture, 80 hidden layers, 256 routed experts, 8 selected experts per token, and FP8 compressed-tensors weights. - Agent Execution: Designed for multi-step task decomposition, tool collaboration, context continuation, and long-horizon execution. The model card reports 87.60 on PinchBench, 63.82 on ClawEval, and 95.32 on Tau2-Bench Telecom for the
highsetting. - Reasoning Effort: The model card describes
highandxhighreasoning-effort modes. In SGLang’s OpenAI-compatible chat API, use top-levelreasoning_effort: "high"for production agent workflows. To request the model-cardxhighprompt path, pass it throughchat_template_kwargs.reasoning_effort. - Hybrid Attention: Uses the Bailing hybrid stack with MLA plus Lightning linear attention kernels in SGLang.
- Context Length: Native 128K in the released config. Configure YaRN separately if you need a 256K deployment.
- FP8 (E4M3 compressed-tensors): inclusionAI/Ring-2.6-1T
2. SGLang Installation
Ring-2.6-1T requires recent SGLang builds with Bailing hybrid model support. Start with the latest SGLang Docker image when validating this cookbook:Command
3. Model Deployment
Use the selector below to generate a single-node command for the tested hardware targets.Configuration Tips
--trust-remote-codeis required for the model’s custom Bailing hybrid implementation.- Use
--tp-size 4on a single 4-GPU GB300 node. - Use
--tp-size 8on a single 8-GPU B200 node. - Use
--tp-size 8on a single 8-GPU H200 node. - Use
--mem-fraction-static 0.95on GB300 x4. The model uses about 238.5GB/GPU after loading, so lower values can fail during KV-pool initialization. - Use
--mem-fraction-static 0.8on B200 x8. - Use
--mem-fraction-static 0.95on H200 x8. --model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}'is recommended because the model has 175 large safetensors shards.- Keep
--tool-call-parser glmenabled by default for OpenAI-compatible tool calls. Ring’s template emits XML<arg_key>/<arg_value>tool calls, which theqwenparser does not convert intomessage.tool_calls. - Keep
--reasoning-parser deepseek-r1enabled by default so<think>...</think>content is split intomessage.reasoning_content.
4. Model Invocation
4.1 Basic Usage
For example, launch the server on a single 4-GPU GB300 node:Command
Command
4.2 Reasoning Effort
Ring-2.6-1T exposes two reasoning-effort levels in the model card:high and xhigh. In SGLang’s OpenAI-compatible chat API, start with top-level reasoning_effort: "high" for agent and production workflows:
Command
xhigh path, pass the template value explicitly:
Command
message.reasoning_content when the model emits <think>...</think> blocks.
4.3 Tool Calling Example
Command
5. Benchmark
5.1 Speed Benchmark
- Hardware: NVIDIA B200 GPU (8x), NVIDIA H200 GPU (8x), and NVIDIA GB300 GPU (4x)
- Model:
inclusionAI/Ring-2.6-1T - Docker image:
lmsysorg/sglang:latest - SGLang version tested:
0.5.11 - Tensor Parallelism: 8 on B200 x8 and H200 x8, 4 on GB300 x4
Command
5.1.1 Latency-Sensitive Benchmark
- Test Command:
Command
- Test Results (B200 x8):
Output
- Test Results (GB300 x4):
Output
- Test Results (H200 x8):
Output
5.1.2 Throughput-Sensitive Benchmark
- Test Command:
Command
- Test Results (B200 x8):
Output
- Test Results (GB300 x4):
Output
- Test Results (H200 x8):
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
- Test Results (B200 x8):
Output
- Test Results (GB300 x4):
Output
- Test Results (H200 x8):
Output
