1. Model Introduction
The Ling-2.6 family from inclusionAI is the next iteration of the Ling instant-model series. Continuing the architectural direction set by Ling-2.5, Ling-2.6 doubles down on inference efficiency, token efficiency, and agent performance — staying competitive with frontier instant models while being faster, leaner, and better suited for production agent workloads. Key Features:- Hybrid Linear Attention: A
1:7 MLA + Lightning Linearhybrid built on top of a highly sparse MoE backbone. Compared with same-class SOTA models, Ling-2.6-flash shows up to ~4× higher prefill and decode throughput in long-context scenarios; Ling-2.6-1T is shipped in FP8 so it fits a single GB300 node with--tp 4. - Token Efficiency: Trained with explicit token-efficiency objectives. On the full Artificial Analysis suite, Ling-2.6-flash uses only ~15M output tokens while remaining competitive — a meaningfully stronger intelligence-per-token profile than long-reasoning peers.
- Agentic Capabilities: Refined for tool use, multi-step planning, and long-horizon execution. Reaches SOTA-class results on BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, and PinchBench, and is validated against Claude Code, Kilo Code, Qwen Code, Hermes Agent, and OpenClaw.
- Long Context: Native 128K, extendable to 256K (Ling-2.6-flash) and 256K → 1M (Ling-2.6-1T via YaRN).
- BF16: inclusionAI/Ling-2.6-flash — 104B total / 7.4B active
- FP8 (E4M3): inclusionAI/Ling-2.6-1T — ~1T total
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
3.1 Ling-2.6-flash
Ling-2.6-flash is a 104B/7.4B-active MoE that runs comfortably on a single 4-GPU node. Use the selector below to generate the launch command for your hardware.Configuration Tips
--trust-remote-codeis required (customBailingMoeV2_5ForCausalLMmodeling code).--tp-size 4is the reference layout. On 4× H20-3e the model reaches ~340 tokens/s decode at TP=4, batch 32.- Native context is 128K. Enable YaRN (
--json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, ...}}') to extend to 256K — the snippet does this for you. --tool-call-parser qwen25matches the model’s<tool_call>...</tool_call>schema.- The recommended baseline does not include
--reasoning-parser qwen3. Ling-2.6 is a controllable-reasoning model whose chat template defaults todetailed thinking off; the SGLangqwen3reasoning parser, in contrast, assumes default-thinking semantics and would mis-route normal output intoreasoning_content. Only enable it if you specifically want<think>...</think>blocks split out — see §4.3 Thinking Mode. - MTP (multi-token prediction) is supported. Add
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --mamba-scheduler-strategy extra_bufferto enable it — see the model card for the full example.
3.2 Ling-2.6-1T
Ling-2.6-1T ships in FP8 (E4M3), so unlike Ling-2.5-1T it fits a single GB300 node with--tp 4. On smaller GPUs (H200/B200), a 2-node deployment with --pp-size 2 is required.
Configuration Tips
--trust-remote-codeis required for the custom modeling code.--model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}'significantly speeds up the multi-shard FP8 weight load (26 safetensors shards + an MTP layer).- Use
--tool-call-parser qwenfor tool calling. - The recommended baseline does not include
--reasoning-parser qwen3. Ling-2.6’s chat template defaults todetailed thinking off, while SGLang’sqwen3reasoning parser assumes default-thinking semantics — combining the two requires a per-request workaround for tool calls (see §4.3 Thinking Mode). Only enable--reasoning-parser qwen3if you specifically want<think>...</think>blocks split intoreasoning_content. - For 2-node deployments, set
MASTER_IP,PORT, andDIST_PORTconsistently across both nodes.
4. Model Invocation
For example, launch a Ling-2.6-1T server on a single GB300 node:Command
4.1 Basic Usage
Command
Config
4.2 Tool Calling Example
Command
Config
4.3 Thinking Mode
Both Ling-2.6-flash and Ling-2.6-1T are controllable-reasoning models. Their chat template uses textual directives in the system message —detailed thinking on or detailed thinking off — to toggle thinking. The template defaults to detailed thinking off when neither phrase is present, and it does not read the Qwen3-style enable_thinking template variable.
Enabling thinking
Includedetailed thinking on in the first system message:
Command
<think>...</think> blocks before its final answer. To get those split into message.reasoning_content automatically, also launch the server with --reasoning-parser qwen3.
Caveat: --reasoning-parser qwen3 + tool calling
The SGLang qwen3 reasoning parser was written for Qwen3, where models are default-thinking and clients opt out via chat_template_kwargs.enable_thinking=false. Ling-2.6 is the opposite — default-non-thinking, with toggling done in the system message. As a result, when the server is launched with both --tool-call-parser qwen and --reasoning-parser qwen3, every tool-call request must include chat_template_kwargs.enable_thinking=false, otherwise the parser routes the <tool_call>...</tool_call> block into reasoning_content instead of message.tool_calls:
Command
enable_thinking here is consumed by the SGLang reasoning parser, not by the chat template — Ling-2.6’s template ignores it. For the simplest configuration, just omit --reasoning-parser qwen3 and toggle thinking via the system message.
For more API examples, see the SGLang Basic Usage Guide.
5. Benchmark
GSM8K (Ling-2.6-1T, GB300 × 4)
Reference run on a single GB300 node with--tp 4:
Command
Output
