1. Model Introduction
Qwen3-Coder-Next is a cost-efficient code-focused language model from the Qwen team (Alibaba). With 80B total parameters but only 3B activated parameters, it achieves performance comparable to models with 10–20x more active parameters through its innovative hybrid architecture. Key Features:- Hybrid Architecture: Uses a 48-layer hybrid layout combining Gated DeltaNet and Gated Attention with Mixture-of-Experts (512 total experts, 10 activated, 1 shared), enabling exceptional efficiency.
- Tool Calling Support: Advanced agentic capabilities with native support for function calling and tool use via the
qwen3_coderparser. - Extended Context Length: Supports up to 256K tokens for processing large codebases and long documents.
- Cost-Efficient Inference: Only 3B parameters activated per token, making it ideal for local development and cost-effective deployment at scale.
- IDE Integration: Compatible with Claude Code, Qwen Code, Cline, and other IDE platforms.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions. Note: Qwen3-Coder-Next requires SGLang v0.5.8 or later.3. Model Deployment
This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and deployment options.3.2 Configuration Tips
- Context Length: The model supports up to 256K tokens natively. If you encounter OOM issues, try
--context-length 32768. - Tool Use: To enable tool calling capabilities, use the
--tool-call-parser qwen3_coderflag. - Sampling Parameters: SGLang automatically applies the recommended sampling parameters from the model’s
generation_config.json. No manual configuration is needed. - Mamba Radix Cache: Qwen3-Coder-Next’s hybrid Gated Delta Networks architecture supports two mamba scheduling strategies via
--mamba-scheduler-strategy:- V1 (
no_buffer): Default. No overlap scheduler, lower memory usage. - V2 (
extra_buffer): Enables overlap scheduling and branching point caching with--mamba-scheduler-strategy extra_buffer --page-size 64. Requires FLA kernel backend. Trades higher mamba state memory for better throughput. Strictly superior in non-KV-cache-bound scenarios; in KV-cache-bound cases, weigh the overlap scheduling benefit against reduced max concurrency.--page-sizemust satisfyFLA_CHUNK_SIZE % page_size == 0orpage_size % FLA_CHUNK_SIZE == 0(FLA_CHUNK_SIZEis currently 64).
- V1 (
4. Model Invocation
Deployment Command:Command
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Code Generation Example
Example
Output
4.2.2 Streaming Example
Example
Output
4.2.3 Tool Calling Example
Qwen3-Coder-Next supports tool calling capabilities. Make sure--tool-call-parser qwen3_coder is included in the deployment command above.
Python Example:
Example
Output
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU (2x)
- Model: Qwen/Qwen3-Coder-Next
- Tensor Parallelism: 2
- sglang version: 0.5.8+
5.1.1 Standard Scenario Benchmark
- Model Deployment Command:
Command
5.1.1.1 Low Concurrency
- Benchmark Command:
Command
- Result:
Output
5.1.1.2 Medium Concurrency
- Benchmark Command:
Command
- Result:
Output
5.1.1.3 High Concurrency
- Benchmark Command:
Command
- Result:
Output
5.1.2 Reasoning Scenario Benchmark
- Model Deployment Command:
Command
5.1.2.1 Low Concurrency
- Benchmark Command:
Command
- Result:
Output
5.1.2.2 Medium Concurrency
- Benchmark Command:
Command
- Result:
Output
5.1.2.3 High Concurrency
- Benchmark Command:
Command
- Result:
Output
5.1.3 Summarization Scenario Benchmark
5.1.3.1 Low Concurrency
- Benchmark Command:
Command
- Result:
Output
5.1.3.2 Medium Concurrency
- Benchmark Command:
Command
- Result:
Output
5.1.3.3 High Concurrency
- Benchmark Command:
Command
- Result:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
- Test Results:
Output
5.2.2 MMLU Benchmark
- Benchmark Command:
Command
- Test Results:
Output
