After a fantastic year at Reve AI, I’ve rejoined Google DeepMind to continue working on VEO. I was deeply involved in its early days, but the rapid progress from VEO 1 to VEO 3 in just one year has truly amazed me. It’s a testament to what can happen when you combine compute, brilliant minds, and a healthy dose of the "bitter lesson."
@YGandelsman@zeeshanp_ Yes true diffusion will require more flops in that setting, while AR require a lot more memory bandwidth. I think diffusion having flops could even be an advantage, since they are still much faster.
@zeeshanp_ and to be fair to the authors, they explicitly said the following that can be missed by reading the headlines: "Our theorem on diffusion models applies only when the output dimension remains fixed. If it grows with problem size, the result may not hold"
This paper seems to be flawed. The authors argued that increasing denoising steps to increase sequential flops is not enough to solve some category of problems; but missed that we can also increase the canvas size and keep denoising steps fixed. If we give both AR and diffusion the same "token" budget, they can solve the same problems.
We’re dropping Gemini Omni: our first step towards a model that can create anything from anything - starting with video.
It combines Gemini’s intelligence with our generative media systems - representing a leap forward in world understanding, multimodality, and editing 🧵
@harshbhatt7585 A better version of this is already done via distillation, the key is how probabilites are approximated and a teacher LLM can do it very effecticely.
Naive patchify is what ViTs did as well. At the time people were surprised that it worked as good as it did with absolutely no overlapping window or similar conv based encoders. But at the end of the day we dont need non local tokens, the interactions between them can easily be modeled with attention in the tranformer backbone
My first blog post in over a year is a deep dive on flow maps🗺️, or how to learn the integral of a diffusion model to enable faster sampling and several other cool tricks.
It's the longest one yet👀 Let me know what you think!
https://t.co/O8bBGZ9qjC