1. Model Introduction
NVIDIA Nemotron3-Super is a leading open model in the Nemotron 3 family, built for running many collaborating agents together. It is optimized for agentic systems that chain planning, reasoning, and tool use workloads that generate far more tokens than single turn chat and require strong reasoning at every step.
Nemotron 3 Super is a 120B parameter hybrid MoE model that activates only 12B parameters per forward pass, delivering strong accuracy for coding, tool calling, and instruction following at a fraction of the cost. It also supports a 1M token context window so agents can keep conversation history and plan state in view across long workflows.
Architecture and key features:
- Hybrid Transformer-Mamba Architecture (MoE): Combines Mixture of Experts with a hybrid Transformer-Mamba architecture, enabling efficient routing and sequence modeling in a single stack.
- Highest throughput efficiency in its size category: Delivers up to 5x higher throughput compared to the previous Nemotron Super model (Llama Nemotron Super 1.5).
- Multi-Token Prediction (MTP): By predicting several future tokens simultaneously in a single forward pass, MTP drastically accelerates the generation of long-form text.
- Thinking Budget support: Supports Thinking Budget for optimal accuracy with minimum reasoning token generation.
2. SGLang Installation
SGLang from the main branch is required for Nemotron3-Super. You can install from source and with a nightly docker.Command
3. Model Deployment
This section provides a progressive guide from quick deployment to performance tuning.3.1 Basic Configuration
Interactive Command Generator: select hardware, tensor parallelism, and common knobs to generate a launch command.3.2 Configuration Tips
- Attention backend: H200: Use flash attention 3 backend by default. B200: Use flashinfer backend by default.
-
TP support:
To set tp size, use
--tp <2|4|8>. -
FP8 KV cache:
To enable fp8 kv cache, please append
--kv-cache-dtype fp8_e4m3.
4. Model Invocation
Command
4.1 Basic Usage (OpenAI-Compatible API)
SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client:Example
Output
Example
Output
4.2 Reasoning
The model supports two modes — Reasoning ON (default) vs OFF. This can be toggled by settingenable_thinking to False, as shown below.
Example
Output
4.3 Tool Calling
Call functions using the OpenAI Tools schema and inspect returnedtool_calls.
Example
Output
4.4 Controlling Reasoning Budget
Thereasoning_budget parameter allows you to limit the length of the model’s reasoning trace. When the reasoning output reaches the specified token budget, the model will attempt to gracefully end the reasoning at the next newline character.
If no newline is encountered within 500 tokens after reaching the budget threshold, the reasoning trace will be forcibly terminated at reasoning_budget + 500 tokens.
Example
reasoning_budget=128:
Example
Output
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: H200 (4x)
- Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
- Tensor Parallelism: 4
- SGLang Version: main branch
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
Environment- Hardware: H200 (4x)
- Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
- Tensor Parallelism: 4
- SGLang Version: main branch
Command
Command
Output
5.2.2 MMLU Benchmark
Run BenchmarkCommand
Output
