Human perception is active: we move around to see, and we see with intention. In our latest work "Seeing without Pixels", we find "how you see" (how the camera moves) roughly reveals "what you do" or "what you observe" -- and this connection can be easily learned from data.
Human learns from unique data -- everyone's OWN life -- but our visual representations eventually align. In our recent work "Unique Lives, Shared World" @GoogleDeepMind, we train models with "single-life" videos from distinct sources, and study their alignment and generalisation.
I’m looking for PhD students in Audio & Video for a Summer 2026 internship at Google DeepMind!
⚠️ Requirement: Prior publication in this area.
To apply, tell me the most critical research gap in AV understanding to see if we are a match! https://t.co/uKQnftKwpJ
A SOTA model on 4D reconstruction from @GoogleDeepMind! Amazing work from @ChuhanZhang5 and the team! It was so satisfactory to see these reconstruction results and I've been having a great experience using it
A SINGLE encoder + decoder for all the 4D tasks!
We release 🎯 D4RT (Dynamic 4D Reconstruction and Tracking).
📍 A simple, unified interface for 3D tracking, depth, and pose
🌟 SOTA results on 4D reconstruction & tracking
🚀 Up to 100x faster pose estimation than prior works
🚀 Glad to share the exciting project — SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass! We explored the generation of 3D scenes with multiple assets from a single image. 🎉 ACCEPTED by 3DV 2026!!!
All resources have been open-sourced and publicly available!
📄 Paper: https://t.co/51GB5oTMZR
💻 Code: https://t.co/g7Z1VuTEla
🔗 Model: https://t.co/zam2NDL30z
🌐 WebPage: https://t.co/MGlUuLyHd9
#3DVision #AI #GenerativeAI #ComputerVision #3DV2026 #SceneGen
Future AI models will learn predominantly post-deployment – to do the tasks of interest to each user. This will happen throughout an individual “life”. In a new paper https://t.co/BrH9FxBqG0 we lay out groundwork for this type of capabilities in the wild from a visual standpoint.
Human learns from unique data -- everyone's OWN life -- but our visual representations eventually align. In our recent work "Unique Lives, Shared World" @GoogleDeepMind, we train models with "single-life" videos from distinct sources, and study their alignment and generalisation.
Excited to share our latest work! Grateful for the guidance from all my collaborators, and special thanks to Tengda for being such an amazing mentor during my internship @GoogleDeepMind 😊
Human perception is active: we move around to see, and we see with intention. In our latest work "Seeing without Pixels", we find "how you see" (how the camera moves) roughly reveals "what you do" or "what you observe" -- and this connection can be easily learned from data.
Can you tell which action corresponds to which camera trajectory in the video above? Check out our paper for answers! Work done by our great intern Sherry Xue @sherryx90099597 at @GoogleDeepMind, and with Kristen Grauman, @dimadamen and Andrew Zisserman.
https://t.co/ukbMRfAkZk
A belated post for our ACMMM paper: we recognize and track animated characters for movie understanding tasks. Great work from Zhongrui Gui, also with @JunyuXieArthur@WeidiXie and Andrew Zisserman from @Oxford_VGG .
Project page with code and dataset: https://t.co/G70041InQ8
Animated movies can be effortlessly understood by young minds, but appear to be challenging for video-language models, why? The key problem is the huge diversity of animated characters -- their appearance ranges from human-like faces, to cars, fish, blobs, etc.