We’re dropping Gemini Omni: our first step towards a model that can create anything from anything - starting with video.
It combines Gemini’s intelligence with our generative media systems - representing a leap forward in world understanding, multimodality, and editing 🧵
Here is a metaphor for AGI definitions.
Imagine you’re on a long drive from Los Angeles to the Bay Area (for me: undergrad to grad school). From far away, this is unambiguous: the Bay Area is very small relative to the Los Angeles/Bay Area distance. People can and do dispute what exactly “The Bay Area” is (there are many definitions), but no one in LA would say, “I have no idea what direction you are going”.
But now you approach the actual Bay Area. It’s a vague place! The definitional ambiguity starts to ramp up. If you’re in Los Gatos and you say “I’m driving to the Bay Area”, people will have questions.
If we track the conversation as we drive from LA on, the definitional ambiguity and disagreement will ramp up over time. A skeptic that “the Bay Area” is a coherent idea might look at the ramp and think “aha I was right, people are starting to realize that the concept was incoherent all along”. And indeed, the people with questions are right to ask them, the relative distances have changed, “But where in the Bay Area?” matters more.
But the definitional ambiguity is because we’re getting close! Something is about to happen!
I uploaded a screenshot of Google Maps to Gemini Omni with a route drawn on it.
Then I prompted it to create a first person view of someone driving a taxi cab along the route in the reference image.
Pretty close to the real thing.
Project Genie is a @GoogleLabs experiment that lets you simulate dynamic worlds you can navigate in real time with Genie, our general-purpose world model.
Today, we’re connecting Project Genie to nearly 20 years of Street View data from Google Maps — so you can now build interactive spaces based on real-world locations.
Street View imagery in Project Genie is available now for places in the U.S., and will expand to more locales over time.
#GoogleIO
The true significance of OMNI models is enabling seamless "any-to-any" inputs and outputs. Regardless of your starting assets or your final product, OMNI allows for an unrestricted flow of ideas across mediums. This represents a massive step forward in the evolution of GenAI.
Gemini Omni is a major leap in world understanding & multimodal editing! It can take photos, video & audio and build entirely new scenes. Over time it’ll be able to handle any input & any output - starting w/ video
You can even give it your own videos & iterate on your ideas:
#Omni can explain complex concepts with much better text rendering, and is great for educational videos!
"Make a video of me explaining backpropagation and gradient descent"