open multimodal model with 30B params, 3B active, for edge AI agents


TL;DR

Nvidia released Nemotron 3 Nano Omni, an open-weight multimodal model that unifies vision, audio, and language in a single architecture with 30B parameters but only 3B active per inference. It claims 9x throughput over comparable open models and tops six benchmarks. Available under Nvidia’s Open Model Agreement for commercial use, it targets edge AI agent deployment on single GPUs, making Nvidia a competitor not just in AI infrastructure but in the models that run on it.

Nvidia released Nemotron 3 Nano Omni on Tuesday, an open-weight multimodal AI model that unifies vision, audio, and language understanding in a single architecture designed to power autonomous AI agents on edge devices. The model has 30 billion parameters but activates only three billion per forward pass through a mixture-of-experts design, a ratio that allows it to run on a single GPU while matching or exceeding the multimodal capabilities of models several times its size. Nvidia claims nine times higher throughput than comparable open multimodal models with equivalent interactivity, 2.9 times faster single-stream reasoning on multimodal tasks, and roughly nine times greater effective system capacity for video reasoning. The model tops six benchmarks across document intelligence, video understanding, and audio comprehension. It processes text, images, audio, video, documents, charts, and graphical interfaces as inputs and produces text as output, meaning a single model can replace the patchwork of specialised vision, speech, and document-processing models that most enterprise AI deployments currently stitch together. The release, available on Hugging Face under Nvidia’s Open Model Agreement with full commercial use rights, represents the most aggressive move yet by the company that sells the infrastructure for AI into the market for the AI itself.

The architecture

Nemotron 3 Nano Omni uses a hybrid Mamba-Transformer architecture with 23 Mamba-2 selective state-space layers, 23 mixture-of-experts layers with 128 experts routing to six per token plus a shared expert, and six grouped-query attention layers. The vision encoder, C-RADIOv4-H, handles variable-resolution images with 16-by-16 patches scaling from 1,024 to 13,312 visual patches per image. The audio encoder, Parakeet-TDT-0.6B-v2, processes speech and environmental audio. Video processing uses three-dimensional convolutions to capture motion between frames rather than treating video as a sequence of still images. The base text model was pretrained on 25 trillion tokens and supports a 256,000-token context window. The architectural choices reflect a specific design philosophy: maximise capability per active parameter rather than total parameters, because edge deployment is constrained not by model size at rest but by compute per inference step. The three-billion active parameters at inference mean the model can run on hardware announced at Nvidia’s GTC 2026 developer conference, including the DGX Spark and DGX Station workstations, without requiring the multi-GPU clusters that power larger models in data centres.

The mixture-of-experts approach is not new, but its application to a multimodal model at this scale is. Most open multimodal models either use a single dense architecture, which requires all parameters to be active on every inference step, or use separate specialist models stitched together in a pipeline, which introduces latency at each handoff. Nemotron 3 Nano Omni does neither. It routes each token to six of 128 experts within a unified model, meaning vision tokens, audio tokens, and text tokens all flow through the same architecture but activate different expertise depending on the modality. The result is a model that can process a video feed, a spoken instruction, and a document simultaneously without the inter-model latency that makes pipeline architectures unsuitable for real-time agent applications. For enterprise deployments, this collapses the operational complexity of maintaining separate vision, speech, and language models with separate inference endpoints, monitoring, and versioning into a single model serving a single endpoint.

The strategy

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol’ founder Boris, and some questionable AI art. It’s free, every week, in your inbox. Sign up now!

Nvidia has spent the AI boom selling infrastructure: GPUs, networking, and the CUDA software ecosystem that locks developers into its hardware. The Nemotron model family, which has been downloaded more than 50 million times in the past year, represents a parallel strategy in which Nvidia also provides the models that run on that infrastructure. The logic is circular but powerful: Nvidia’s models are optimised for Nvidia’s hardware, and Nvidia’s hardware is optimised for Nvidia’s models, creating a full-stack ecosystem that competes with the model-plus-cloud offerings from Google, Amazon, and Microsoft. The case for small, domain-specific language models has been made across education, healthcare, and enterprise, and Nemotron 3 Nano Omni extends that argument to multimodal applications: rather than calling a massive cloud model for every vision or audio task, enterprises can run a compact model locally that handles the full perceptual stack.

Early enterprise adoption includes Foxconn, Palantir, Aible, ASI, Eka Care, and H Company, with Dell, DocuSign, Infosys, Oracle, and Zefr evaluating the model for production deployment. The use cases, factory-floor visual inspection, document processing, voice agent applications, and screen understanding for computer-use agents, reflect the market Nvidia is targeting: not consumer AI assistants but industrial AI agents that need to see, hear, and read in real time on local hardware. The model is available as an Nvidia NIM microservice, through Amazon SageMaker JumpStart, and on OpenRouter, with deployment options including vLLM, SGLang, Ollama, llama.cpp, and TensorRT-LLM. The breadth of deployment options is itself a competitive statement: Nvidia is making the model runnable everywhere, on every framework, to maximise adoption and deepen the dependency on Nvidia’s broader ecosystem.

The competition

Open-source AI models designed for agentic reasoning are arriving from multiple directions simultaneously. DeepSeek’s V4-Pro and V4-Flash, released last week, use a hybrid attention architecture optimised for long-horizon agentic tasks. Meta’s Llama models dominate the open-weight text space. Google’s Gemini models handle multimodal tasks at cloud scale. OpenAI’s GPT models remain the commercial benchmark. What distinguishes Nemotron 3 Nano Omni is not any single capability but the combination: multimodal perception across vision, audio, and text in a single model, with mixture-of-experts efficiency that enables edge deployment, released as open weights with commercial licensing. No other model currently offers all four properties together. The closest comparators, Google’s Gemini Nano for on-device and Meta’s Llama for open weights, each lack at least one element: Gemini Nano is not open-weight, and Llama’s multimodal capabilities do not include audio processing in a unified architecture.

The competitive implications extend beyond the model itself. If Nvidia’s open models become the default for edge AI agent deployment, the company captures value at every layer of the stack: the GPU that runs inference, the software framework that optimises it, and now the model itself. Competitors who build on Nvidia’s models deepen their dependency on Nvidia’s hardware. Competitors who build their own models still need Nvidia’s GPUs to train them. The agentic AI era is accelerating across the industry, and Nvidia’s strategy is to be indispensable at every layer rather than dominant at one. Nemotron 3 Nano Omni is not Nvidia’s answer to GPT-4o. It is Nvidia’s argument that the future of AI agents will be built on small, efficient, open models running on Nvidia hardware at the edge, rather than large, proprietary models running on someone else’s cloud. Whether that argument holds depends on whether the enterprises building the next generation of autonomous systems prefer local control over cloud convenience, and whether a model with three billion active parameters can do the work that currently requires models with hundreds of billions. The benchmarks say it can. The market will decide whether the benchmarks are right.



Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


Whoop MG on arm

The Whoop is one of the devices that Google’s rumored screenless health tracker would compete with.

Nina Raemont/ZDNET

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways 

  • Google is poised to unveil a Whoop dupe soon. 
  • Steph Curry teased a screenless health band on his Instagram. 
  • Here’s what I’d like to see from a Google fitness band. 

Could Google’s latest fitness tracker return to its original, screenless Fitbit form? All signs say yes. Google has teased a screenless, Whoop-adjacent health tracker with the help of basketball star Steph Curry. A recent Instagram post from Curry shows him wearing a screenless, fabric band around his wrist, and the accompanying caption promotes “a new relationship with your health.” 

There are scant confirmed details on this next device, but rumors suggest the band will be called “Fitbit Air.” 

Also: I replaced my Whoop with a rival fitness band that has no monthly fees – and it’s nearly as good

Why a screenless fitness band? And why now? Google’s new device could be taking interest away from popular fitness brand Whoop. Whoop’s fitness band is on the more luxurious end of the health wearables spectrum. The company offers three subscription tiers, starting at $199, $239, and $359 annually. Google’s device, on the other hand, is rumored to be more affordable with the option to upgrade to Fitbit Premium. 

Google has the opportunity to make an accessibly priced fitness band with the rumored Fitbit Air and breathe new life into its older Fitbit product lineup, which hasn’t been updated in years. 

What I’m expecting 

Here’s what I expect to see and what I hope Google prioritizes in this new health tracker.

Given Fitbit’s bare-bones approach to fitness tracking, I assume Google will emphasize an affordable, accessible fitness band with the Fitbit Air. Most Fitbit products cost between $130 and $230, so I’m expecting this band to be on the lower end of that price range. I’d also expect Fitbit to give users a free trial of Fitbit Premium. 

Also: T-Mobile is practically giving away the Apple Watch Series 11 – here’s how to get one

A long, long, long battery life 

A smartwatch with a bright screen and integrations with an accompanying smartphone consumes a lot of power. That’s why some of the best smartwatches on the market have a middling battery life of one to two days, tops. 

A fitness band, on the other hand, is screenless. That makes the battery potential on this Fitbit Air double — or even triple — that of Google’s smartwatches.

Also: I use this 30-second routine to fix sluggish Samsung smartwatches – and it works every time

The Fitbit Inspire 3 has around 10 days of battery life — with a watch display. I hope the screenless Fitbit Air has at least 10 days of battery life, plus some change. Two weeks of battery life would be splendid. 

In addition to usage time, I also hope that a screenless fitness tracker addresses some of the issues Fitbit Inspire users have complained about. Many Inspire users report that the device’s screen died after a year of use. They could still access data through the app, but the screen was dysfunctional. Despite being a more affordable Google health tracker, the Fitbit Air should last users for a few years without any hardware issues — or at least I hope it does. 

Fitbit’s classically accurate heart rate measurements 

As Google’s Performance Advisor and the athlete teasing Google’s next device, Steph Curry is sending the message that this new device, one that offers wearers “a new relationship with your health,” will be built for athletes and exercise enthusiasts. I hope this device homes in on accurate heart rate measurements and advanced sensing, as other Fitbit devices do. 

Also: I walked 3,000 steps with my Apple Watch, Google Pixel, and Oura Ring – this tracker was most accurate

Like Whoop, I hope the insights the Fitbit Air provides are performance- and recovery-driven. Whoop grew in popularity for exactly this reason. Not only do Whoop users get their sleep and recovery score, but they also see, through graphs and health data illustrations, how their daily exercise exertion, strain, and sleep interact with and inform each other. 

I’m assuming that Fitbit Premium, with its AI-powered health coach and revamped app design, may do a lot of the heavy lifting for sleep and recovery insights with this new product. 

Also: Are AI health coach subscriptions a scam? My verdict after testing Fitbit’s for a month

But I also hope Google adds a few features on the app’s home screen that specifically target athletic strain and recovery, beyond the steps, sleep, readiness, and weekly exercise percentage already available on the Fitbit app’s main screen. 

Lots of customizable, distinct bands 

I hope the Fitbit Air is cheap — and the accompanying bands are even cheaper. If the rumors of affordability are true, then I’d hope Fitbit sells bands that can be worn with the device that match users’ styles and color preferences at a similarly affordable and accessible price point. Curry wears a gray-orange band in his teaser. I hope the colorways for this device are bold, patterned, and easily distinguishable from rival fitness bands. 





Source link