We are excited to announce the 2026 cohort of RSS Pioneers! This year’s cohort brings together an outstanding group of early-career researchers whose work spans the breadth of robotics. A heartfelt thank you to all the organizers who made this year’s program possible.
1/ World models are getting popular in robotics 🤖✨
But there’s a big problem: most are slow and break physical consistency over long horizons.
2/ Today we’re releasing Interactive World Simulator:
An action-conditioned world model that supports stable long-horizon interaction.
3/ Key result:
✅ 10+ minutes of interactive prediction
✅ 15 FPS
✅ on a single RTX 4090🔥
4/ Why this matters: it unlocks two critical robotics applications:
🚀 Scalable data generation for policy training
🧪 Faithful policy evaluation
5/ You can play with our world model NOW at https://t.co/SBqVDzYn86. NO git clone, NO pip install, NO python. Just click and play!
NOTE ⚠️
ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera
More details coming 👇 (1/9)
#Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning
Pretraining with dynamics models of motor behavior (aka world models) from video will be much more central to robotics than VLMs. There are multiple choices of representations (e.g. 3D? JEPA?) but we will figure this out by and by. Exciting times!
Fully agreed with the sentiment that much of computer vision research (concretely, those not for “human consumption”) should be grounded in robotics. But as a robotics researcher, I think the more nuanced question is: how can we *rethink* these intermediate representations for embodied intelligence rather than discarding them?
Why?
The challenge, as also pointed out in Vincent’s article, is precisely the lack of perception-action data at scale. This is why intermediate representations IMO are *preferable rather than obsolete* because they open up training from scalable data sources. This can include even the vision/language encoders people love and use in robot learning — it’s hard to imagine training low-level visual representation or high-level language understanding purely from limited robot data. The same goes for intermediate representations at the structure level — world modeling, learning from Internet videos, learning from humans, and simulation — many of which still rely on 3D representations too.
At the RI seminar at CMU yesterday, I presented a 3 level analysis of robot skills & discussed the pros and cons of teleoperation, simulation, and learning from videos, before presenting our research. Enjoy! https://t.co/wUPe3QlTRk
As video world models become increasingly powerful, do we still need explicit 3D?
A commonly misunderstood point is this: video world models are not “just 2D.” Their ability to maintain multi-view consistency, temporal stability, and realistic interaction necessarily implies that their latent knowledge encodes 3D world structure. Without some notion of 3D, consistency itself would not be possible.
The real distinction, therefore, is not whether a model has 3D but whether that 3D exists implicitly or explicitly.
Implicit 3D lives inside latent spaces and network weights. It supports generation, but it is difficult to localize, edit, constrain, or reason about.
It allows the world to exist, but not to be used.
Explicit 3D, in contrast, exists as structure and state: it is addressable, editable, composable, and transferable.
Its purpose is not better visual fidelity, but operability to allow the world to be manipulated, controlled, and executed.
From this perspective, video and 3D are not competing paradigms but a layered system:
2D/video is the interface to human perception; 3D is the interface to the physical world.
They can reinforce each other, but neither forms a closed loop on its own.
In practice, data not model architecture sets the upper bound of world models.
Explicit 3D may not be the final user-facing representation, but it is likely the most effective pathway toward scalable, high-quality, and controllable data.
Through explicit 3D/4D representations, worlds can be constructed systematically:
interactions can be programmatically sampled, states and actions can be composed, rendered into images and videos, and fed back to train video world models.
Seen this way, 3D is not the destination it is the starting point for scaling.
What truly drives progress forward is never the model itself.
Whether we capture the world or imagine new ones,
whether data comes from observation or intent,
whether we model what is or what should be the direction of the world is ultimately determined by human choice and purpose.
Models may extend the world,
but humans decide where it goes.
#Genie3 #worldmodel
There has been a clear trend in the last months moving from VLA-type approaches to Video Generative Models + Inverse Dynamics Models (VAM).
While the probable main reason of this recent growth is the latest improvements in video generative models, I believe this shift is relevant for robotics.
While the VLA's distill the foundation models knowledge through some latent representations that intertwine semantic and spatial information, VAM distill this knowledge in a more explicit way, representing it spatially.
I believe this spatial grounding of VAM might lead to way larger generalization capabilities wrt. VLA and I am optimistic in even more 3D spatially grounded foundation models in the direction of the recent @wenlong_huang https://t.co/PAsihtci7Y
@junjungoal Very cool! Happy to see the Value-function based approach works that well! Very refreshing approach in front of end-2-end generative model approaches :)
While I really liked the article, it feels to me that this physical commonsense can be better capture by predicting next observations (i.e. world models) and planning on it, rather than training a policy on predicting next action (i.e. behavioral cloning)
@jparkerholder The spatial-temporal consistency looks suberb!
If solid, the implications for robotics are huge! Imagine generating training environments on-the-fly from natural language. Excited to see how this evolves toward embodied agent training.
@drfeifei 100% on the boat of 3D/4D world models!Generative 3D environments could unlock much broader domain randomization and edge case coverage.
I am anyway curious how the physics fidelity compares to hand-crafted sims for contact-rich manipulation tasks.
We just released AINA, a framework for learning robot policies from Aria 2 demos, and are now open-sourcing the code: https://t.co/HSHrtUrt11. It includes:
✅ Aria 2 data processing into 3D observations like shown
✅Training of point-based policies
✅Calibration
Give it a try!
This was very challenging and very cool to see evolve!
I personally was no sure if it would work, but @irmakkguzey pushed so hard to show it does.
Learning dexterous robot policies with only human video data, using the egocentric view from Aria2 glasses, chill and easy 😁
Dexterous manipulation by directly observing humans - a dream in AI for decades - is hard due to visual and embodiment gaps.
With simple yet powerful hardware - Aria 2 glasses 👓 - and our new work AINA 🪞, we are now one significant step closer to achieving this dream.
After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀
Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video.
In pursuit of minimal modeling, DA3 reveals two key insights:
💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture.
✨ A single depth-ray representation is enough. No complex 3D tasks.
Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series.
The core team members, aside from me: @HaotongLin, Sili Chen, Jun Hao Liew, @donydchen.
👇(1/n)
#DepthAnything3