Introducing VGGT-Ω: scaling feed-forward reconstruction across static and dynamic scenes, and studying whether the learned geometric representations transfer beyond reconstruction.
🚀 Introducing Articraft, a coding agent for articulated 3D asset creation.
Articraft writes code, executes it, receives validation feedback, and refines the result into simulation-ready 3D assets with parts, joints, and motion.
We’re also releasing Articraft-10K: 10,000+ articulated objects across 250 categories, unlocking large-scale interactive scenes for robotics simulation and physical AI.
🔗 Project page: https://t.co/FWutv61yx7
💻 Code: https://t.co/CpCYdBzMlv
We made an interactive client-server viewer for LagerNVS with @JonathonLuiten!
You can now interactively explore scenes from just a photo capture - no optimization, no 3D Gaussians, just load your images, run the model on a cloud GPU and stream the renders to your local browser.
Check out the video below for some spaces I recently captured in Oxford, London and beyond!
We’re reimagining a 50-year-old interface - the mouse pointer - with AI. 🖱️
These experimental demos show how people can intuitively direct Gemini on their screens using motion, speech, and natural shorthand to get things done 🧵
We scaled up Lyra to generate explorable 3D worlds! 🚀
Introducing Lyra 2.0 — turning a single image into a 3D world you can walk through, look back, and even drop a robot into 🤖
Code and Model available today!
🌐 Website: https://t.co/plBxCoWkNn
(1/N)
Introducing ActionParty: the first video world model that controls up to 7 players simultaneously on the same screen across 46 game environments.
We tackle the action binding problem in video diffusion, ensuring each player's action is applied to the right subject. 🧵
Dropping an exciting new demo of MosaicMem! 👀🔥
A friend brought up a great question:
why not combine long-horizon navigation video generation, promptable world events, and scene concatenation?
Fair point — so we gave it a shot. 🎬✨
For more technical details, check this thread 🧵👇
https://t.co/qyQYwmHsE6
#WorldModel #GenerativeAI #VideoGeneration #InteractiveAI #Genie3 #EmbodiedAI #GameAI
🎉EgoEdit @Snapchat has been accepted to CVPR 2026! 🏆👻
We are bringing high-quality, real-time editing to egocentric videos. Our massive 100k video dataset and benchmark are ALREADY PUBLIC! 🔓🚀
🏠 Project Page: https://t.co/cEUZRxdLDf
🤗 Dataset: https://t.co/qCFRTY8cYG
Excited to share our new work: “Learning to See Before Seeing”! 🧠➡️👀 We investigate an interesting phenomeno: how do LLMs, trained only on text, learn about the visual world?
Project page: https://t.co/9mQt3qnckL
🎉 VMem is officially accepted to ICCV 2025!
Excited to chat with everyone in Hawaii about making video generation consistent and interactive with our Surfel-Indexed View Memory 🏝️🎥
Also, huge thanks to my insanely helpful coauthors!
Excited to share VMem: a novel memory mechanism for consistent video scene generation 🎞️✨
VMem evolves its understanding of scene geometry to retrieve the most relevant past frames, enabling long-term consistency
🌐 https://t.co/AHBj6j1ecE
🤗 https://t.co/FbUbJHWW4F
1/ 🧵
After two amazing years with @Oxford_VGG, I will be joining @NTUsg as a Nanyang Assistant Professor in Fall 2025!
I’ll be leading the Physical Vision Group (https://t.co/byLxP7FE4a) — and we're hiring for next year!🚀
If you're passionate about vision or AI, get in touch!
🎁 We present Geo4D, a method that repurposes a video diffusion model for monocular 4D reconstruction.
Project page: https://t.co/BPvlH9tDEP
Code repo: https://t.co/i1pSbsKAQu
𝐌𝐚𝐢𝐧 𝐂𝐨𝐧𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧𝐬:
✨ A novel framework, Geo4D, to reconstruct the dynamic scene, which builds on top of an off-the-shelf video generator.
✨ A multi-modal geometric representation that helps the video diffusion model to learn consistent geometry during training.
✨ A lightweight multi-modal alignment that fuses partially redundant geometric modalities at test time for coherent and robust 4D reconstruction.
✨ Achieved SOTA performance on video depth estimation and comparable performance on camera pose estimation.
Thanks to all co-authors for their invaluable support and contributions. @ChuanxiaZ, Iro Laina, @dlarlus, Andrea Vedaldi
Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for:
✅ Camera Pose Estimation
✅ Multi-view Depth Estimation
✅ Dense Point Cloud Reconstruction
✅ Point Tracking
Project Page: https://t.co/Qoc1ipqozq
Code & Weights: https://t.co/1GkCpRATkE