Qwen 3.5 Usage#
Qwen 3.5 is Alibaba’s latest generation LLM featuring a hybrid attention architecture, advanced MoE with shared experts, and native multimodal capabilities.
Key architecture features:
Hybrid Attention: Gated Delta Networks (linear, O(n) complexity) combined with full attention every 4th layer for high associative recall
MoE with Shared Experts: Top-8 active out of 64 routed experts plus a dedicated shared expert for universal features
Multimodal: DeepStack Vision Transformer with Conv3d for native image and video understanding
Launch Qwen 3.5 with SGLang#
Dense Model#
To serve Qwen/Qwen3.5-397B-A17B on 8 GPUs:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3.5-397B-A17B \
--tp 8 \
--trust-remote-code
AMD GPU (MI300X / MI325X / MI35X)#
On AMD Instinct GPUs, use the triton attention backend. Both the full attention layers and the Gated Delta Net (linear attention) layers use Triton-based kernels on ROCm:
SGLANG_USE_AITER=1 python3 -m sglang.launch_server \
--model-path Qwen/Qwen3.5-397B-A17B \
--tp 8 \
--attention-backend triton \
--trust-remote-code
Tip
Set SGLANG_USE_AITER=1 to enable AMD’s optimized aiter kernels for MoE and GEMM operations.
Configuration Tips#
--attention-backend: Usetritonon AMD GPUs for Qwen 3.5. The hybrid attention architecture (Gated Delta Networks + full attention) works best with the Triton backend on ROCm. The linear attention (GDN) layers always use Triton kernels internally via theGDNAttnBackend.--watchdog-timeout: Increase to1200or higher for this large model, as weight loading takes significant time.--model-loader-extra-config '{"enable_multithread_load": true}': Enables parallel weight loading for faster startup.
Reasoning and Tool Calling#
Qwen 3.5 supports reasoning and tool calling via the Qwen3 parsers:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3.5-397B-A17B \
--tp 8 \
--trust-remote-code \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder
Accuracy Evaluation#
You can evaluate the model accuracy using lm-eval:
pip install lm-eval[api]
lm_eval --model local-completions \
--model_args '{"base_url": "http://localhost:8000/v1/completions", "model": "Qwen/Qwen3.5-397B-A17B", "num_concurrent": 256, "max_retries": 10, "max_gen_toks": 2048}' \
--tasks gsm8k \
--batch_size auto \
--num_fewshot 5 \
--trust_remote_code