1. Model Introduction
NVIDIA Nemotron3-Ultra is an open frontier reasoning model in the Nemotron 3 family, built for long-running autonomous agents. It is optimized for complex orchestration across coding, deep research, enterprise workflows, and EDA use cases where agents must sustain reasoning across many steps and large context windows.
Nemotron 3 Ultra is a 550B parameter hybrid MoE model that activates only 55B parameters per forward pass, delivering frontier reasoning accuracy with high-throughput inference. It supports a 1M token context window so agents can keep conversation history, tool outputs, and plan state in view across persistent workflows.
Architecture and key features:
- Hybrid Transformer-Mamba Architecture (MoE): Combines Mixture of Experts with a hybrid Transformer-Mamba architecture, enabling efficient routing and sequence modeling in a single stack.
- Long-horizon agentic reasoning: Tuned for agents that plan, call tools, inspect results, recover from failures, and continue working across long task horizons — coding, deep research, enterprise automation, and EDA.
- 1M token context window: Sustains coherent agent state across extended workflows without re-ingestion.
- BF16 and NVFP4 quantization: Deployable from multi-node H100 down to a single Blackwell node with NVFP4.
- Multi-environment RL post-training: Post-trained with reinforcement learning across multiple environments for robust reasoning and reliable agentic behavior.
- Open weights, open data, open recipes: Customizable for domain-specific agents and deployable across your own infrastructure.
- BF16: 16×H100, 16×H200, 8×B200/B300
- NVFP4: 4/8×B200/B300, 4×GB200/GB300
2. SGLang Installation
Nemotron3-Ultra support has not yet propagated tolmsysorg/sglang:latest or any stable release. Pull one of the two dedicated images below — matching your CUDA version — to get a runtime with Nemotron3-Ultra support.
Command
3. Model Deployment
This section provides a progressive guide from quick deployment to performance tuning.3.1 Basic Configuration
Interactive Command Generator: select model precision, hardware, tensor parallelism, and common knobs to generate a launch command. The generator only emits a runnable command for combinations that NVIDIA / SGLang have validated. Selecting an unverified tuple (e.g. NVFP4 on H100/H200, BF16 with TP=4 on H100, …) is blocked — the command pane shows an explicit error and the verified support matrix instead of a launch line, so unvalidated commands can’t be copied by accident.3.2 Configuration Tips
- Attention backend: H100/H200: Use flash attention 3 backend by default. B200/GB200/B300/GB300: Use flashinfer backend by default.
-
TP support:
To set tp size, use
--tp <4|8|16>. Recommended pairings:- BF16:
--tp 16on H100/H200,--tp 8on B200/B300 - NVFP4:
--tp 4or--tp 8on B200/B300,--tp 4on GB200/GB300
- BF16:
-
Multi-node BF16 on H100:
The 16×H100 BF16 setup spans two nodes. Use
--dist-init-addr <head-node-ip>:5000 --nnodes 2 --node-rank <0|1>on each node and keep--tp 16. -
DP attention:
By default the attention layers are tensor-parallel (sharded across all TP ranks). Enabling DP attention (the toggle above, or
--dp <N> --enable-dp-attention) instead runs attention asNdata-parallel groups: each DP rank serves its own slice of the requests with its own KV cache.--dpmust divide--tp. -
FP8 KV cache:
To enable fp8 kv cache, please append
--kv-cache-dtype fp8_e4m3. -
Reasoning parser:
Append
--reasoning-parser nemotron_3to enable structured reasoning traces (reasoning_contentfield in the response). -
Tool calling:
Append
--tool-call-parser qwen3_coderto enable tool calling support.
4. Model Invocation
Command
4.1 Basic Usage (OpenAI-Compatible API)
SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client:Example
Output
Example
Output
4.2 Reasoning
The model supports two modes — Reasoning ON (default) vs OFF. This can be toggled by settingenable_thinking to False, as shown below.
Example
Output
4.3 Tool Calling
Call functions using the OpenAI Tools schema and inspect returnedtool_calls. The server must be launched with --tool-call-parser qwen3_coder.
Example
Output
4.4 Controlling Reasoning Budget
Thereasoning_budget parameter allows you to limit the length of the model’s reasoning trace. When the reasoning output reaches the specified token budget, the model will attempt to gracefully end the reasoning at the next newline character.
If no newline is encountered within 500 tokens after reaching the budget threshold, the reasoning trace will be forcibly terminated at reasoning_budget + 500 tokens.
Example
reasoning_budget=256:
Example
Output
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: GB200 (4x)
- Model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
- Tensor Parallelism: 4
- SGLang Version: main branch
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
Environment- Hardware: GB200 (4x)
- Model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
- Tensor Parallelism: 4
- SGLang Version: main branch
Command
Command
Output
5.2.2 MMLU Benchmark
Run BenchmarkCommand
Output
