Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.sglang.io/llms.txt

Use this file to discover all available pages before exploring further.

1. Model Introduction

Ring-2.6-1T is InclusionAI’s trillion-parameter flagship reasoning model for real-world complex task execution. It targets agent workflows, engineering development, scientific research analysis, enterprise automation, and other long-horizon settings where the model must plan, use tools, recover from intermediate errors, and keep context across multiple steps. Key Features:
  • Trillion-Scale Reasoning Model: BailingMoeV2_5ForCausalLM with a bailing_hybrid architecture, 80 hidden layers, 256 routed experts, 8 selected experts per token, and FP8 compressed-tensors weights.
  • Agent Execution: Designed for multi-step task decomposition, tool collaboration, context continuation, and long-horizon execution. The model card reports 87.60 on PinchBench, 63.82 on ClawEval, and 95.32 on Tau2-Bench Telecom for the high setting.
  • Reasoning Effort: The model card describes high and xhigh reasoning-effort modes. In SGLang’s OpenAI-compatible chat API, use top-level reasoning_effort: "high" for production agent workflows. To request the model-card xhigh prompt path, pass it through chat_template_kwargs.reasoning_effort.
  • Hybrid Attention: Uses the Bailing hybrid stack with MLA plus Lightning linear attention kernels in SGLang.
  • Context Length: Native 128K in the released config. Configure YaRN separately if you need a 256K deployment.
Available Models: License: MIT

2. SGLang Installation

Ring-2.6-1T requires recent SGLang builds with Bailing hybrid model support. Start with the latest SGLang Docker image when validating this cookbook:
Command
docker pull lmsysorg/sglang:latest
For other installation methods, please refer to the official SGLang installation guide.

3. Model Deployment

Use the selector below to generate a single-node command for the tested hardware targets.

Configuration Tips

  • --trust-remote-code is required for the model’s custom Bailing hybrid implementation.
  • Use --tp-size 4 on a single 4-GPU GB300 node.
  • Use --tp-size 8 on a single 8-GPU B200 node.
  • Use --mem-fraction-static 0.95 on GB300 x4. The model uses about 238.5GB/GPU after loading, so lower values can fail during KV-pool initialization.
  • Use --mem-fraction-static 0.8 on B200 x8.
  • --model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}' is recommended because the model has 175 large safetensors shards.
  • Keep --tool-call-parser glm enabled by default for OpenAI-compatible tool calls. Ring’s template emits XML <arg_key>/<arg_value> tool calls, which the qwen parser does not convert into message.tool_calls.
  • Keep --reasoning-parser deepseek-r1 enabled by default so <think>...</think> content is split into message.reasoning_content.

4. Model Invocation

4.1 Basic Usage

For example, launch the server on a single 4-GPU GB300 node:
Command
export PORT=30000

sglang serve \
  --model-path inclusionAI/Ring-2.6-1T \
  --tp-size 4 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port ${PORT} \
  --mem-fraction-static 0.95 \
  --model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}' \
  --tool-call-parser glm \
  --reasoning-parser deepseek-r1
Send a basic chat request:
Command
curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 128
  }'

4.2 Reasoning Effort

Ring-2.6-1T exposes two reasoning-effort levels in the model card: high and xhigh. In SGLang’s OpenAI-compatible chat API, start with top-level reasoning_effort: "high" for agent and production workflows:
Command
curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Solve: if 3x + 7 = 52, what is x?"}],
    "reasoning_effort": "high",
    "max_tokens": 512
}'
For the model-card xhigh path, pass the template value explicitly:
Command
curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Solve: if 3x + 7 = 52, what is x?"}],
    "chat_template_kwargs": {"reasoning_effort": "xhigh"},
    "max_tokens": 512
  }'
With the default deployment command, thinking text is separated into message.reasoning_content when the model emits <think>...</think> blocks.

4.3 Tool Calling Example

Command
curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is the weather in Beijing?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "The city name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto",
    "max_tokens": 512
  }'
For more API examples, see the SGLang Basic Usage Guide.

5. Benchmark

5.1 Speed Benchmark

  • Hardware: NVIDIA B200 GPU (8x) and NVIDIA GB300 GPU (4x)
  • Model: inclusionAI/Ring-2.6-1T
  • Docker image: lmsysorg/sglang:latest
  • SGLang version tested: 0.5.11
  • Tensor Parallelism: 8 on B200 x8, 4 on GB300 x4
Use the deployment command from Section 3, then confirm that the server is healthy before running benchmarks:
Command
curl -s http://localhost:${PORT}/health
curl -s http://localhost:${PORT}/v1/models

5.1.1 Latency-Sensitive Benchmark

  • Test Command:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port ${PORT} \
  --model inclusionAI/Ring-2.6-1T \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf
  • Test Results (B200 x8):
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  207.18
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.05
Input token throughput (tok/s):          29.45
Output token throughput (tok/s):         20.37
Total token throughput (tok/s):          49.82
Mean E2E Latency (ms):                   20715.16
Mean TTFT (ms):                          187.86
Mean TPOT (ms):                          44.65
Mean ITL (ms):                           48.76
==================================================
  • Test Results (GB300 x4):
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  62.21
Total input tokens:                      6101
Total generated tokens:                  4220
Request throughput (req/s):              0.16
Input token throughput (tok/s):          98.07
Output token throughput (tok/s):         67.83
Total token throughput (tok/s):          165.91
Mean E2E Latency (ms):                   6218.57
Mean TTFT (ms):                          233.04
Mean TPOT (ms):                          14.21
Mean ITL (ms):                           14.22
==================================================

5.1.2 Throughput-Sensitive Benchmark

  • Test Command:
Command
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port ${PORT} \
  --model inclusionAI/Ring-2.6-1T \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 100 \
  --max-concurrency 100 \
  --request-rate inf
  • Test Results (B200 x8):
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     100
Benchmark duration (s):                  46.30
Total input tokens:                      50561
Total generated tokens:                  52444
Request throughput (req/s):              2.16
Input token throughput (tok/s):          1092.10
Output token throughput (tok/s):         1132.77
Total token throughput (tok/s):          2224.86
Mean E2E Latency (ms):                   27581.74
Mean TTFT (ms):                          1710.53
Mean TPOT (ms):                          51.27
Mean ITL (ms):                           49.43
==================================================
  • Test Results (GB300 x4):
Output
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     100
Benchmark duration (s):                  55.80
Total input tokens:                      50561
Total generated tokens:                  52444
Request throughput (req/s):              1.79
Input token throughput (tok/s):          906.10
Output token throughput (tok/s):         939.84
Total token throughput (tok/s):          1845.94
Mean E2E Latency (ms):                   33736.85
Mean TTFT (ms):                          2156.40
Mean TPOT (ms):                          63.09
Mean ITL (ms):                           60.33
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

  • Benchmark Command:
Command
python3 -m sglang.test.run_eval \
  --eval-name gsm8k \
  --host 127.0.0.1 \
  --port ${PORT} \
  --model auto \
  --num-examples 200 \
  --num-threads 64 \
  --max-tokens 2048 \
  --reasoning-effort high
  • Test Results (B200 x8):
Output
Total latency: 100.378 s
Score: 0.990
Output throughput: 627.401 token/s
  • Test Results (GB300 x4):
Output
Total latency: 98.386 s
Score: 0.990
Output throughput: 621.469 token/s