1. Model Introduction
The DeepSeek-V3.2 series includes three model variants, each optimized for different use cases: DeepSeek-V3.2-Exp is an upgraded version of DeepSeek-V3.1-Terminus, introducing the DeepSeek Sparse Attention (DSA) mechanism through continued training. DSA is a fine-grained sparse attention mechanism powered by a lightning indexer, enabling DeepSeek-V3.2-Exp to achieve significant efficiency improvements in long-context scenarios. Recommended for general conversations, long-context processing, and efficient inference. DeepSeek-V3.2 is the standard version suitable for general tasks and conversational scenarios. For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 0.95. Recommended for standard conversations and general tasks. DeepSeek-V3.2-Speciale is a special variant designed exclusively for deep reasoning tasks. This model is specifically optimized for scenarios requiring complex logical reasoning and deep thinking. However this model does not support tool calls (see below). For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 0.95. Recommended for deep reasoning tasks, complex logical problems, and mathematical reasoning. DeepSeek-V3.2-NVFP4 is an NVIDIA-optimized NVFP4-quantized variant of DeepSeek-V3.2 for Blackwell devices. It uses ModelOpt FP4 quantization with a choice of MoE runner backends (flashinfer_trtllm (recommended), flashinfer_cutlass, or flashinfer_cutedsl), enabling efficient deployment with lower tensor parallelism (TP=4). It supports the same features as DeepSeek-V3.2 including tool calling, reasoning, and speculative decoding (MTP).
DeepSeek-V3.2-MXFP4 is an OCP-MXFP4 optimized variant for DeepSeek-V3.2 for AMD MI300X/MI355X devices. It uses OCP MXFP4 quantization with a triton mxfp4 backend (the same backend for gptoss-120B), enabling efficient deployment with lower tensor parallelism (TP=8) in a single node. It includes the same features as DeepSeek-V3.2 including tool calling, reasoning, fp8-kv, CP, TP and speculative decoding MTP.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities. SGLang supports serving DeepSeek V3.2 on NVIDIA H200, B200, and AMD MI300X/MI355X GPUs.3.2 Configuration Tips
For more detailed configuration tips, please refer to DeepSeek-V3.2 Usage.4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Reasoning Parser
DeepSeek-V3.2 supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:Command
Example
Output
4.2.2 Tool Calling
DeepSeek-V3.2 and DeepSeek-V3.2-Exp support tool calling capabilities. But they use different parameters. Enable the tool call parser: Note: DeepSeek-V3.2-Speciale does NOT support tool calling. It is designed exclusively for deep reasoning tasks. Deployment Command: For DeepSeek-V3.2-Exp:Command
--tool-call-parser deepseekv32 and remove --chat-template.
Python Example (with Thinking Process):
Example
Output
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
Example
4.2.3 Enabling PP, CP and TP with FP8 KV cache
We suggestedDP2 + MTP for local deployment of agentic workflow with DeepSeek V3.2 on Hopper platform:
Command
CP is currently enabled with PP=2 on Hopper platform and we can reduce TP=16 to TP=8 from standalone deployment:
Command
Command
5. Benchmark
5.1 Speed Benchmark on Blackwell
Test Environment:- Hardware: NVIDIA B200 GPU (8x)
- Model: DeepSeek-V3.2-Exp
- Tensor Parallelism: 8
- sglang version: 0.5.6
5.1.1 Latency-Sensitive Benchmark
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Test Results:
Output
5.1.2 Throughput-Sensitive Benchmark
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
- Test Results:
- DeepSeek-V3.2-Exp
- DeepSeek-V3.2-Exp
5.2.2 MMLU Benchmark
- Benchmark Command:
Command
- Test Results:
- DeepSeek-V3.2-Exp
- DeepSeek-V3.2-Exp
5.3 Speed Benchmark on Hopper
Test Environment:- Hardware: NVIDIA H800 GPU (16x)
- Model: DeepSeek-V3.2
- Tensor Parallelism: 16
- sglang version: 0.5.9
5.3.1 Latency-Sensitive Benchmark
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Test Results:
Output
5.3.2 Throughput-Sensitive Benchmark
We simply use the same deployment method and vary the throughput by maximizing concurrencies:Command
1024 and when concurrency is greater than 128, the TTFT increase sharply:
Output
--random-range-ratio 1, we could get even higher statistical numbers:
Output
