1. Model Introduction
Qwen3-Coder is the latest code-focused large language model series from the Qwen team. Built on the foundation of Qwen3, Qwen3-Coder delivers exceptional performance in code generation, understanding, and reasoning tasks. Key Features:- State-of-the-art Coding Performance: Achieves top-tier results on HumanEval, MBPP, LiveCodeBench, and other major coding benchmarks.
- Tool Calling Support: Native support for function calling and tool use, enabling seamless integration with external APIs and services.
- Extended Context Length: Supports up to 256K tokens for processing large codebases and long documents.
- Multilingual Code Support: Proficient in Python, JavaScript, TypeScript, Java, C++, Go, Rust, and many other programming languages.
- MoE Architecture: Efficient Mixture-of-Experts design for optimal performance-to-cost ratio.
- ROCm Support: Compatible with AMD MI300X, MI325X and MI355X GPUs via SGLang (verified).
- NVIDIA GPU Support: Compatible with NVIDIA GB200 and B200 GPUs via SGLang (verified).
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations verified on AMD MI300X, MI325X, MI355X and NVIDIA B200, GB200 hardware platforms.3.1 Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, and quantization method.3.2 Configuration Tips
AMD (MI300X/MI325X/MI355X):- Memory Management: We have verified successful deployment on MI300X/MI325X/MI355X with
--context-length 8192. Larger context lengths may be supported but require additional memory. - Expert Parallelism: For 480B-A35B with FP8 quantization,
--ep 2is required to satisfy the dimension alignment requirement. - Page Size:
--page-size 32is recommended for MoE models to optimize memory usage. - Environment Variable: If you encounter aiter-related issues, try setting
SGLANG_USE_AITER=0.
- MOE Runner Backend: FP8 uses
--moe-runner-backend triton, NVFP4 uses--moe-runner-backend flashinfer_cutlass. - NVFP4 Quantization: Requires
--quantization modelopt_fp4and uses a different model path (nvidia/Qwen3-Coder-...). - DP Attention: NVFP4 configuration supports
--enable-dp-attentionfor improved throughput.
- Tool Use: To enable tool calling capabilities, add
--tool-call-parser qwen3_coderto the launch command.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Code Generation Example
Example
Output
4.2.2 Tool Calling Example
Qwen3-Coder supports tool calling capabilities. Enable the tool call parser during deployment. The following example uses 30B-A3B model:Command
Example
Output
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: AMD MI300X GPU (8x)
- Model: Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
- Tensor Parallelism: 8
- Expert Parallelism: 2
- sglang version: 0.5.7
5.1.1 Standard Scenario Benchmark
- Model Deployment Command:
Command
5.1.1.1 Low Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.1.2 Medium Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.1.1.3 High Concurrency
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
AMD (MI300X/MI325X/MI355X)
-
Results:
- Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
- Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
NVIDIA (B200/GB200)
For deployment commands, see Section 3.1.-
Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 (tp=8, ep=2)
-
nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP (NVFP4, tp=8, ep=1)
