
The first model in DeepMind’s new Omni family will generate and edit video from any combination of image, audio, video, and text inputs. Speech-editing is being withheld; SynthID watermarking is on by default.
Google introduced Gemini Omni on Tuesday at the I/O 2026 developer conference, a new multimodal model family from Google DeepMind designed to generate and edit video from any combination of image, audio, video, and text inputs.
The first model in the family, Gemini Omni Flash, started rolling out the same day to the Gemini app and Google Flow for Google AI Plus, Pro, and Ultra subscribers, and to YouTube Shorts and the YouTube Create app at no cost. API access for developers and enterprise customers will follow in the coming weeks.
The product framing, from Koray Kavukcuoglu, CTO of Google DeepMind and Chief AI Architect at Google, is that Omni ‘combines images, audio, video, and text as input and generates high-quality videos grounded in Gemini’s real-world knowledge.’ Inputs can be mixed in a single prompt.
Edits are made conversationally, with each instruction building on the previous one, so that characters, physics, and scene context persist across turns. Output modalities beyond video, including image and audio generation, are ‘coming in time,’ Kavukcuoglu wrote on the company’s blog.
Omni’s positioning, on the published materials, rests on three claims. First, the model has an improved intuitive understanding of physical forces, including gravity, kinetic energy, and fluid dynamics, allowing it to generate scenes with more accurate physics.
Second, it draws on Gemini’s existing world knowledge to connect language, imagery, and meaning beyond pattern-matching, with the company demonstrating prompts that range from claymation protein-folding explainers to chain-reaction physics tracks. Third, the conversational-editing layer preserves consistency across multi-turn revisions, where prior video models have tended to drift on character identity or scene continuity.
The release also extends the Omni family to digital-avatar generation. Avatars let users record their own voice and likeness to create videos that look and sound like them, with onboarding requiring recording yourself and speaking a series of numbers aloud.
]Beyond avatars, Google is explicitly withholding general-purpose audio and speech editing inside Omni for now. ‘We are still working to test this and better understand how we can bring this capability to users responsibly,’ Kavukcuoglu wrote, in a paragraph that third-party coverage has read as a deliberate step back from the deepfake-adjacent territory of consent-free voice editing.
All videos generated with Omni will carry Google’s SynthID imperceptible digital watermark by default. Users can verify whether a clip was generated by Omni through the Gemini app, Gemini in Chrome and Google Search, the company said.
The SynthID layer is the same watermarking infrastructure OpenAI adopted earlier this year under the C2PA open standard, and is now positioned as the cross-industry default for AI-generated visual provenance.
On the disclosed initial limits, Flash-tier clips are capped at 10 seconds at launch, a deployment decision rather than a model constraint. The cap is shorter than OpenAI’s Sora maximum of 60 seconds, where Sora’s tokenisation-of-spatiotemporal-patches architecture is the closest published frontier-model comparison.
Google has not disclosed the per-clip cost structure, the compute footprint per generation, or the benchmark suite it used to evaluate Omni against Veo 3 or third-party models such as ByteDance’s Seedance.
Omni is the headline model in a wider I/O 2026 announcement that also included Gemini 3.5 and what Sundar Pichai called the ‘agentic Gemini era’ in his keynote post. The strategic question for the model, on the announcement and immediate analyst reads, is whether the multi-input conversational editing flow is genuinely a new product category or a tighter integration of capabilities the broader frontier-video field has already demonstrated.
The next visible proof point will be the API rollout to developers and enterprise customers in the coming weeks, where the cost structure and the upper bound on clip length under paid tiers will become public.
What Google has not yet disclosed: the underlying Omni model architecture relative to Veo 3, the per-generation compute footprint, pricing for clips beyond the Flash tier, benchmark scores against DeepMind’s own prior video models and competing frontier offerings, and the timeline for general-purpose audio and speech editing inside the Omni family.
The avatar onboarding process and SynthID enforcement are, on the announcement, the company’s formal answer to the consent-and-provenance questions the launch invites.
