What if VLMs could imagine before answering?
IPT supervises visual intermediate states for spatial reasoning:
1. Path tracing → side view
2. Perspective taking → new viewpoint
3. Multiview counting → top-down map
Paper: https://t.co/57KvrXgPFv
What if VLMs could imagine visually before answering spatial questions?
New paper: Imaginative Perception Tokens (IPT) teach multimodal LMs to reason about hidden 3D structure — without generating images at inference time.
Paper: https://t.co/57KvrXgPFv
Picture your living room. If you sat on the sofa, would the TV be on your right or left? You didn't reason in words,you placed yourself in the scene.Imagining in visual space, not text.Exactly what VLMs can't do.Our new paper tackles this with Imaginative Perception Tokens(IPT)🧵
Given the strong community adoption and real-world deployment of MolmoAct2 on YAM, we're introducing zero-shot evaluation of MolmoAct2 Bimanual YAM in simulation. Now you can test out our models without a real-world robot and build on them!
Code: https://t.co/ooIbp8BRf9
Simulation built on Maniskill!
Thrilled to share that VLS received the 🥳 Outstanding Paper Award at the CVPR 2026 Foundation Models Meet Embodied Agents Workshop and the 🏅 Best Paper Runner-Up at the CVPR 2026 3D-LLM/VLA Workshop!
Huge thanks to @liu_shuo42927 for presenting at CVPR, and to @YiqingXu6@DJiafei@RanjayKrishna for their guidance and support throughout! 🙌
It is truly the right time to work on vision-language steering for embodied agents🚀🚀#CVPR2026 🏆
And that's a wrap on a fantastic ICRA 2026! 🎉 Incredible run for MolmoBot — clean sweep on workshops, winning Best Paper at all three we entered: Synthetic Data for Robot Learning, Beyond Teleoperation, and VLA Pipelines. 🤖
MolmoAct2 Deployed at CVPR
Very cool to watch @RanjayKrishna ‘s talk together with his model stacking the cups into a tower
We should have more live demo like this in the future
The One RING 🪐 was presented as an Oral at #ICRA2026 in Vienna! 🎉
I couldn’t make it to Vienna this time, but huge thanks to @rosemhendrix for presenting our work on my behalf ❤️🤖
VideoNet will appear as @CVPR Highlight✨ + 3 workshops TODAY!
Multimodal AI is improving fast, but can it tell apart moves only a domain expert could name?✒️🪀
You probably got a hang of it from clips below😉 — Can models do the same with few-shot examples?
Come find out 👇
🔗 Website: https://t.co/Y9AOF3Rid0
📍 Poster: Fri, Jun 5, 4:00–6:00 PM
🗓️ Workshops (all today, Jun 4):
- KnowledgeMR — 🏆Best Paper Award candidate, talk by @tanushyy
- CVSports
- VidLLMs
Our paper MolmoB0T won top honor at the SDRL workshop today! Congrats to my co-authors @ab_deshpande, Maya, Snehal, @RanjayKrishna, @shahdhruv_ (plus others, you know how it goes), and we'll see you as well at the VLA Pipelines and Beyond Teleoperation workshops on Friday 🚀
#CVPR2026@cvpr If you're interested in the intersection of multimodal and spatial intelligence, join our ✨MUSI workshop✨ on June 3 (Wed)!
We’re bringing together an amazing lineup of speakers to discuss the latest and most exciting topics in multimodal spatial intelligence🧠
The 5th Transformers for Vision and Multimodal AI workshop is happening at #CVPR2026 tomorrow (Wednesday, June 3rd)! We've got a great speaker lineup covering diverse topics across Transformers and Multimodal AI.
When: Wed, June 3rd
Where: Room 607
Website: https://t.co/SD892nEr8z
Schedule:
1:50 - 2:00 Opening Remarks
2:00 - 2:30 Ranjay Krishna
2:30 - 3:00 Jiatao Gu
3:00 - 3:30 Sherry Yang
3:30 - 4:00 Coffee Break
4:00 - 4:30 Juan Carlos Niebles
4:30 - 5:00 Zhuang Liu
5:00 - 5:30 Peter Tong
See you all tomorrow!
@thoma_gu@RanjayKrishna@sherryyangML@jcniebles@liuzhuang1234@TongPetersb
Really blows me away how many people are seeing the power and impact of open-science models like MolmoAct2 being deployed out of the box, without any fine-tuning. That’s exactly the future I envision for robotics foundation models.
Today’s release:
I open-sourced our evaluation stacks for DROID / YAM arm on MolmoAct 2, including both the policy server and inference stack! Now it’s easy to test frontier bimanual robot foundation models with it. Check it out!
https://t.co/xVNsjJ1aBa
https://t.co/KVPkOzAkak