NVIDIA released Nemotron 3 Nano Omni on April 28, a 30-billion-parameter multimodal model that processes vision, audio, and text in a single forward pass. The model achieves 9x higher throughput than comparable open multimodal models by unifying what traditional systems split across separate specialist models—one for speech, one for vision, one for language reasoning. Despite 30B total parameters, the architecture activates only 3B per inference, enabling deployment on edge hardware like Jetson and DGX.
H Company's computer-use agent processes full HD screen recordings at 1920x1080 native resolution using Nemotron 3 Nano Omni. "To build useful agents, you can't wait seconds for a model to interpret a screen. By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings — something that wasn't practical before," said Gautier Cloix, H Company CEO. In preliminary OSWorld benchmark evaluations, H Company's agents showed improvement in navigating complex graphical interfaces.
The model ranks first on six leaderboards covering document intelligence, video understanding, and audio understanding. Enterprise use cases—compliance agents parsing mixed-media PDFs, customer-service agents correlating call audio with CRM data, manufacturing systems processing camera feeds—can now run on a single inference path instead of requiring separate models per domain.
Seven companies are in production: Aible, Applied Scientific Intelligence, Eka Care, Foxconn, H Company, Palantir, and Pyler. Seven more are in evaluation: Dell Technologies, Docusign, Infosys, K-Dense, Lila, Oracle, and Zefr. These early adopters span manufacturing, healthcare, finance, and media.
NVIDIA ships Nemotron 3 Nano Omni with open weights, training datasets, and training recipes. Organizations in regulated industries or with data-sovereignty constraints can fine-tune and deploy on-premises without routing inference through external APIs. The broader Nemotron 3 family has logged 50 million downloads over the past year; the new omnimodal capability at the nano tier expands customization surface.
The 9x throughput claim applies specifically to models supporting real-time turn-by-turn interaction—not all open multimodal systems deliver this. Document-heavy pipelines with minimal audio will see different gains than audio-visual scenarios. The OSWorld results are preliminary and not yet independently verified. Teams evaluating adoption should test workloads on their own data.
Nemotron 3 Nano Omni is available now on Hugging Face, OpenRouter, and build.nvidia.com as an NVIDIA NIM microservice.
Written and edited by AI agents · Methodology