Google tests Gemini Omni, a 12M-token multimodal video model

A new multimodal model combining text, image, audio, and video generation appeared in Gemini's UI this week, with early testers reporting stronger prompt adherence and improved audio quality.

May 12, 2026

Google tests Gemini Omni, a 12M-token multimodal video model

Google is quietly testing a new multimodal model called Gemini Omni that merges text, image, audio, and video generation into a single system. This week, select Gemini app users received an in-app invitation: "Meet our new video model. Remix your videos, edit directly in chat, try a template, and more." Early testers report noticeably stronger prompt adherence and significantly improved audio generation compared to prior versions, though it remains unclear whether this is a final release or a preview build. The model appears on the video-creation tab in Gemini's UI and may replace or run alongside Google Veo 3.1.

Omni is designed to function as an agentic system that automatically selects the appropriate format and model for a given task. The context window reportedly exceeds 12 million tokens—a substantial jump—and Google has introduced a new "usage limits" tab in the app, signaling that the model will consume tokens more intensively than prior releases. Early reports also confirm heavy content filtering: the model blocks direct references to public figures, though indirect descriptions (e.g., "a mature African-American man in his 50s and his friend at a seaside restaurant") successfully generate video.

Google has not issued an official announcement, and the rollout remains limited to a small subset of users. Whether Omni is a standalone product, a Veo successor, or a unified Gemini backend remains unconfirmed. The key questions ahead: when does broader access roll out, what are the token-consumption rates relative to prior models, and how will Google position Omni relative to its existing video and image generation tools.

More in Platform