1. Model Introduction
Qwen3.5-397B-A17B is the latest flagship model in the Qwen series developed by Alibaba, representing a significant leap forward with unified vision-language foundation, efficient hybrid architecture, and scalable reinforcement learning. Qwen3.5 features a Gated Delta Networks combined with sparse Mixture-of-Experts architecture (397B total parameters, 17B activated), delivering high-throughput inference with minimal latency. It supports multimodal inputs (text, image, video) and natively handles context lengths of up to 262,144 tokens, extensible to over 1M tokens. Key Features:- Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models
- Efficient Hybrid Architecture: Gated Delta Networks + sparse MoE (397B total / 17B active) for high-throughput inference
- Hybrid Reasoning: Thinking mode enabled by default with step-by-step reasoning, can be disabled for direct responses
- Tool Calling: Built-in tool calling support with
qwen3_coderparser - Multi-Token Prediction (MTP): Speculative decoding support for lower latency
- 201 Language Support: Expanded multilingual coverage across 201 languages and dialects
| Model | BF16 (Full precision) | FP8 (8-bit Quantized) | FP4 (4-bit Quantized) |
|---|---|---|---|
| Qwen3.5-397B-A17B | Qwen/Qwen3.5-397B-A17B | Qwen/Qwen3.5-397B-A17B-FP8 | nvidia/Qwen3.5-397B-A17B-NVFP4 |
| Qwen3.5-122B-A10B | Qwen/Qwen3.5-122B-A10B | Qwen/Qwen3.5-122B-A10B-FP8 | - |
| Qwen3.5-35B-A3B | Qwen/Qwen3.5-35B-A3B | Qwen/Qwen3.5-35B-A3B-FP8 | - |
| Qwen3.5-27B | Qwen/Qwen3.5-27B | Qwen/Qwen3.5-27B-FP8 | - |
| Qwen3.5-9B | Qwen/Qwen3.5-9B | - | - |
| Qwen3.5-4B | Qwen/Qwen3.5-4B | - | - |
| Qwen3.5-2B | Qwen/Qwen3.5-2B | - | - |
| Qwen3.5-0.8B | Qwen/Qwen3.5-0.8B | - | - |
2. SGLang Installation
SGLang from the main branch is required for Qwen3.5. You can install from source or use a Docker image:Command
3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and capabilities.3.2 Configuration Tips
- Speculative decoding (MTP) can significantly reduce latency for interactive use cases.
- Mamba Radix Cache: Qwen3.5’s hybrid Gated Delta Networks architecture supports two mamba scheduling strategies via
--mamba-scheduler-strategy:- V1 (
no_buffer): Default. No overlap scheduler, lower memory usage. Required for AMD MI GPUs. - V2 (
extra_buffer): Enables overlap scheduling and branching point caching with--mamba-scheduler-strategy extra_buffer --page-size 64. Requires FLA kernel backend (NVIDIA GPUs only). Trades higher mamba state memory for better throughput. Strictly superior in non-KV-cache-bound scenarios; in KV-cache-bound cases, weigh the overlap scheduling benefit against reduced max concurrency.--page-sizemust satisfyFLA_CHUNK_SIZE % page_size == 0orpage_size % FLA_CHUNK_SIZE == 0(FLA_CHUNK_SIZEis currently 64).
- V1 (
- The
--mem-fraction-staticflag is recommended for optimal memory utilization, adjust it based on your hardware and workload. - Context length defaults to 262,144 tokens. If you encounter OOM errors, consider reducing it, but maintain at least 128K to preserve thinking capabilities.
- To speed up weight loading for this large model, add
--model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'to the launch command. - CUDA IPC Transport: Add
SGLANG_USE_CUDA_IPC_TRANSPORT=1as an environment variable to use CUDA IPC for transferring multimodal features, significantly improving TTFT (Time To First Token). Note: this consumes additional memory proportional to image size, so you may need to lower--mem-fraction-staticor--max-running-requests. - Multimodal Attention Backend: Use
--mm-attention-backend fa3on H100/H200 for better vision performance, or--mm-attention-backend fa4on B200/B300. - B200 (FP8): Add
--enable-flashinfer-allreduce-fusionfor optimized throughput on Blackwell. - For processing large images or videos, you may need to lower
--mem-fraction-staticto leave room for image feature tensors. - Hardware requirements:
- BF16: ~397B parameters require ~800GB of GPU memory for weights.
- H100 (80GB) requires tp=16 (2 nodes) since each rank needs ~100GB at tp=8.
- H200 (141GB) runs with tp=8.
- B200 (183GB) runs with tp=8.
- B300 (275GB) runs with tp=4.
- MI300X (192GB) runs with tp=8.
- MI325X (256GB) runs with tp=4.
- MI355X (288GB) runs with tp=4.
- FP8: The FP8 quantized model requires ~400GB for weights, cutting memory in half.
- H100 (80GB) runs with tp=8.
- H200 (141GB) runs with tp=4.
- B200 (183GB) runs with tp=4.
- B300 (275GB) runs with tp=2.
- MI300X (192GB) runs with tp=4.
- MI325X (256GB) runs with tp=2.
- MI355X (288GB) runs with tp=2.
- FP4: The FP4 quantized model requires ~250GB for weights, cutting memory by almost 4x. Only compatible with B200/B300 (Blackwell architecture).
- B200 (183GB) runs with tp=4.
- B300 (275GB) runs with tp=2.
- BF16: ~397B parameters require ~800GB of GPU memory for weights.
| Hardware | Memory | BF16 TP | FP8 TP | FP4 TP |
|---|---|---|---|---|
| H100 | 80GB | 16 | 8 | N/A |
| H200 | 141GB | 8 | 4 | N/A |
| B200 | 183GB | 8 | 4 | 4 |
| B300 | 275GB | 4 | 2 | 2 |
| MI300X | 192GB | 8 | 4 | N/A |
| MI325X | 256GB | 4 | 2 | N/A |
| MI355X | 288GB | 4 | 2 | N/A |
4. Model Invocation
NVIDIA: Deploy Qwen3.5-397B-A17B with the following command (H200, all features enabled):Command
Command
Note: TP8 works on all MI GPUs. For MI325X/MI355X, you can use —tp 4 as the minimum requirement.
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Vision Input
Qwen3.5 supports image and video inputs as a unified vision-language model. Here is an example with an image:Example
Output
4.3 Advanced Usage
4.3.1 Reasoning Parser
Qwen3.5 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned viareasoning_content in the streaming response.
To disable thinking and use Instruct mode, pass chat_template_kwargs at request time:
- Thinking mode (default): The model performs step-by-step reasoning before answering. No extra parameters needed.
- Instruct mode (
{"enable_thinking": false}): The model responds directly without a thinking process.
reasoning_content:
Example
Output
{"enable_thinking": false} via chat_template_kwargs:
Example
Output
4.3.2 Tool Calling
Qwen3.5 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, passextra_body={"chat_template_kwargs": {"enable_thinking": False}}.
Python Example (with Thinking Process):
Example
Output
5. Benchmark
5.1 Accuracy Benchmark
5.1.1 GSM8K Benchmark
- Benchmark Command
Command
- Test Result
Output
5.1.2 MMMU Benchmark
- Benchmark Command
Command
- Test Result
Output
5.2 Speed Benchmark
Test Environment:- Hardware: H200 (8x)
- Model: Qwen3.5-397B-A17B
- Tensor Parallelism: 8
- SGLang Version: main branch
Command
5.3.1 Latency Benchmark
Command
Output
5.3.2 Throughput Benchmark
Command
Output
5.3 Vision Speed Benchmark
We use SGLang’s built-in benchmarking tool to conduct performance evaluation with random images. Each request has 128 input tokens, two 720p images, and 1024 output tokens.5.3.1 Latency Benchmark
Command
Output
5.3.2 Throughput Benchmark
Command
Output
