1. Model Introduction
NVIDIA Nemotron 3 Nano Omni is a 30B-parameter hybrid MoE multimodal model that activates only 3B parameters per forward pass, combining vision and audio encoders into a unified architecture. Part of the Nemotron 3 family, it is designed to power multimodal sub-agents that perceive and reason across vision, audio, and language in a single inference loop — eliminating the fragmented stacks of separate models for each modality.
Architecture and key features:
- Hybrid Transformer-Mamba Architecture (MoE): Combines Mixture of Experts with a hybrid Transformer-Mamba architecture for efficient routing and sequence modeling.
- 30B total / 3B active parameters: Delivers strong multimodal accuracy at a fraction of the cost of dense models.
- 1M token context window: Sustains coherent agent state across extended multimodal workflows — screen history, document content, and audio context remain in view without re-ingestion.
- Unified vision and audio encoders: One model replaces fragmented multimodal stacks; vision and audio perception happen in the same forward pass.
- 3D Convolution (Conv3D): Efficient temporal-spatial processing for video inputs.
- Efficient Video Sampling (EVS): Enables longer video processing at the same compute budget via temporal-aware perception and adaptive frame sampling.
- FP8 and NVFP4 quantization: FP8 supports deployment from workstation (RTX 6000, DGX Spark) to cloud (H100, H200, B200, A100, L40S); NVFP4 requires Blackwell hardware.
- 9x higher throughput than other open omni models at the same interactivity level.
- ~20% higher multimodal intelligence compared to the best open alternative.
- Post-trained with multi-environment reinforcement learning via NVIDIA NeMo RL and NeMo Gym across text, image, audio, and video environments, improving instruction following and convergence to correct multimodal answers.
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoningnvidia/Nemotron-3-Nano-Omni-30B-A3B-BF16nvidia/Nemotron-3-Nano-Omni-30B-A3B-FP8nvidia/Nemotron-3-Nano-Omni-30B-A3B-NVFP4
- Computer Use Agent: Perception loop for agents navigating GUIs — reads screens, understands UI state over time, validates outcomes. Collapses vision and reasoning into a single loop.
- Document Intelligence: Interprets documents, charts, tables, screenshots, and mixed media inputs for enterprise analysis and compliance workflows.
- Audio & Video Understanding Agents: Maintains continuous audio-video context for customer service, research, and monitoring workflows, tying what was said, shown, and documented into a single reasoning stream.
2. SGLang Installation
Install SGLang via pip or from source:Command
3. Model Deployment
This section provides a progressive guide from quick deployment to performance tuning.3.1 Basic Configuration
Interactive Command Generator: select hardware, model variant, and common knobs to generate a launch command.3.2 Configuration Tips
- Attention backend: H100/H200: Use flash attention 3 backend by default. B200: Use flashinfer backend by default.
-
TP support:
To set tensor parallelism, use
--tp <1|2|4|8>. A 4×H100 setup is recommended for the BF16/Reasoning variant. -
FP8 KV cache:
To enable FP8 KV cache, append
--kv-cache-dtype fp8_e4m3. FP8 KV cache trades a small amount of accuracy for memory; omit the flag if you observe accuracy regressions on your workload. -
Reasoning parser:
Append
--reasoning-parser deepseek-r1to enable structured reasoning traces (reasoning_contentfield in the response). -
Tool calling:
Append
--tool-call-parser qwen3_coderto enable tool calling support.
4. Model Invocation
The command below launches the server for a 4×H100 setup with reasoning and tool calling enabled. See Section 4.8 for FP8 and NVFP4 variants.Command
4.1 Basic Usage (Text)
SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client:Example
Output
Example
4.2 Image Understanding
Pass image inputs using the OpenAI vision format. Supports both URLs and base64-encoded images:Example
Example
4.3 Video Understanding
Nemotron 3 Nano Omni uses Conv3D layers and Efficient Video Sampling (EVS) for temporal-spatial video reasoning, processing longer videos at the same compute budget:Example
4.4 Audio Understanding
Pass audio inputs as base64-encoded WAV or MP3 data:Example
4.5 Mixed Multimodal Input
Combine modalities in a single request. For example, an image alongside an audio question about it:Example
4.6 Reasoning
The model supports two modes — Reasoning ON (default) vs OFF. Toggle per-request by settingenable_thinking to False:
Example
Output
4.7 Tool Calling
Call functions using the OpenAI Tools schema. The server must be launched with--tool-call-parser qwen3_coder:
Example
Output
4.8 FP8 and NVFP4 Deployment
FP8 variant (recommended for throughput-critical serving on H100/H200/B200):Command
Command
5. Benchmark
5.1 Efficiency Benchmark
Nemotron 3 Nano Omni achieves 9x higher throughput than other open omni models at the same interactivity level, delivering lower cost and better scalability without sacrificing responsiveness. It also achieves ~20% higher multimodal intelligence compared to the best open alternative across image, video, and audio reasoning tasks.5.2 Speed Benchmark
Test Environment:- Hardware: H100 (4×)
- Model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning
- Tensor Parallelism: 4
- SGLang Version: main branch
Command
Command
5.3 Accuracy Benchmark
5.3.1 GSM8K Benchmark
Environment- Hardware: H100 (4×)
- Model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning
- Tensor Parallelism: 4
- SGLang Version: main branch
Command
Command
5.3.2 MMLU Benchmark
Run BenchmarkCommand
