Our Vision-Language-Action robot demo at #RSS2025 was eye-opening. The ultimate eval for any generalist model: new environment, new objects from audience, and new instructions.
For the first time it really hit me: what if we've been underestimating what these models can do?
Introducing Veo Robotics!
In this work, we show that an action-conditioned video model can be used as a general robot simulator for evaluation, safety, etc.
https://t.co/CFVvSCZ0GR
Our Vision-Language-Action robot demo at #RSS2025 was eye-opening. The ultimate eval for any generalist model: new environment, new objects from audience, and new instructions.
For the first time it really hit me: what if we've been underestimating what these models can do?
🔥Gemini Robotics On-Device is here! VLA with similar generalization, instruction following, and fast adaptation as our March release, now fits on a 4090!
More exciting: we're 🚀an SDK and a model dev service (flywheel) alongside it 🎯= democratizing model development!
#DeepMind #robotics is collaborating with select Trusted Testers to refine the process. For everyone else, check the 🧵 below for videos showcasing.
We're just getting started—all suggestions and guidance are welcome!
https://t.co/Uzdc3isfyY
What was once a dream is now real! ✨
Excited to announce Gemini Robotics On-Device: our VLA model that runs locally and shows impressive performance on 3 robot types. On-device intelligence, no internet needed!
We’re bringing powerful AI directly onto robots with Gemini Robotics On-Device. 🤖
It’s our first vision-language-action model to help make robots faster, highly efficient, and adaptable to new tasks and environments - without needing a constant internet connection. 🧵
✨🤖 Today our team is so excited to bring Gemini 2.0 into the physical world with Gemini Robotics, our most advanced AI models to power the next generation of helpful robots. 🤖✨
Check it out! https://t.co/cRLKmKmcFV
And read our blog: https://t.co/k8NE4tg2Cs
We are looking forward to seeing how robot developers will use these models to continue to advance robot performance with Gemini at the core.
✨New blog post✨: my attempt as a vision researcher at finally understanding RLHF -- a deep dive into PPO & DeepSeek's GRPO!
No hot take, I promise.
https://t.co/cjIgpd7c14
The ultimate test of any physics simulator is its ability to deliver real-world results.
With MuJoCo Playground, we’ve combined the very best: MuJoCo’s rich and thriving ecosystem, massively parallel GPU-accelerated simulation, and real-world results across a diverse range of robot platforms: quadrupeds, humanoids, dexterous hands, and arms.
Best of all? You can get started today with a single command: pip install playground
https://t.co/t6pZCNeOSK
Gemini 2.0 Flash's video understanding is here 🚀
Think: search in videos via timecodes, extract text from moving camera footage, analyze screen recordings in real-time interactions with native audio out 🔊
Come and try it https://t.co/Z9zVQbNBUD 😀
https://t.co/Axa4IVplCo
🧵1/8 So annoying when my 🤖 vacuum cleaner buzzes loudly during my Zoom meeting! Can we teach robots to be aware of their noise levels at home? Introducing ANAVI—a framework that uses indoor visuals to predict sound propagation! 🎶🏠
YouTube is a LARGE dataset of demonstration videos to train Generalist robot agents, but lacks action data.
How can we learn DEXTEROUS skills from them?
In #CoRL2024, we explore the problem of learning a Generalist Piano Playing agent from YouTube videos.
https://t.co/nRRy3hdqkL
We just released PaliGemma-3B, a very capable Vision-Language Model. Do not waste any time, finetune it for your task:
Code: https://t.co/V9wQU7jtmv
Colab: https://t.co/aDGJd7Iz8z
Kaggle: https://t.co/A5ZrnjDZni
HF: https://t.co/Du52eHcXNh
Vertex AI: https://t.co/qxK9Irgera
We just released a big 🎁GIVT update!
📈 Larger models and improved image generation results across the board
💡 Improved GMM formulation and adapter module
💻 Code, model checkpoints, and a colab are now available at https://t.co/zaf5orekfZ
More details below... 1/
What if we could show a robot how to do a task?
We present Vid2Robot, which is a robot policy trained to decode human intent from visual cues and translate it into actions in its environment. 🤖
Website: https://t.co/ufFHK1Dgbg
Arxiv: https://t.co/qEUjaXovJa
🧵(1/n)
Can we train a model to describe different parts of images in varying levels of detail?
Introducing FlexCap, a VLM designed to output localized captions in N words where we can control N with special length tokens.
https://t.co/tDsyHF1AVI
@adityagolatkar2 The model counts implicitly because of special length tokens that we add. If we use the token length_N then the model outputs N words before outputting EOS.
Project webpage: https://t.co/tDsyHF28Lg
Arxiv: https://t.co/D8BubwWjnl
This is joint work with @viddivj, @JonathanTompson, Andrew Zisserman and @yusufaytar .
FlexCap has been useful for robotics. We used it in AutoRT (https://t.co/04yiJAldCg) to find objects in the robot's environment. It also helped create the dataset used to train SpatialVLM (https://t.co/Zyo3dJ6gV7).