Real-world RL is still too brittle and data-hungry for long-horizon, contact-rich tasks.
We introduce Simulation Distillation (SimDist), which turns large-scale simulated experience into reusable world-model priors for rapid real-world adaptation.
By combining online planning with dynamics adaptation, SimDist achieves high success rates on tasks requiring precision, force, and reactivity.
Play with our interactive visualization to see for yourself: https://t.co/qFGNySxdAl
(1/n)
Weโre releasing OmniReset, a framework for training robot policies using large-scale RL and diverse resets for contact-rich, dexterous manipulation.
OmniReset pushes the frontier of robustness and dexterity, without any reward engineering or demonstrations.
Try the policies yourself in our interactive simulator! https://t.co/3hW3nYx2vD
(1/N ๐งต)
A reward model that works, zero-shot, across robots, tasks, and scenes?
Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories.
Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more!
๐งต (1/12)
๐๐ซ๐๐๐ฆ๐๐๐ซ๐จ ๐ข๐ฌ #๐ ๐จ๐ง ๐๐จ๐ญ๐ก ๐๐จ๐ฅ๐ฆ๐จ๐๐ฉ๐๐๐๐ฌ ๐๐ง๐ ๐๐จ๐๐จ๐๐ซ๐๐ง๐ ๐
๐ช๐ต๐ฎ๐ ๐บ๐ฎ๐ธ๐ฒ๐ ๐๐ต๐ถ๐ ๐ป๐ผ๐๐ฎ๐ฏ๐น๐ฒ: DreamZero-DROID is trained ๐๐๐๐ ๐ ๐๐๐๐ก๐โ using only the DROID dataset. No pretraining on large-scale robot data, unlike competing VLAs. This demonstrates the strength of video-model backbones for generalist robot policies (VAMs/WAMs).
More broadly, training ๐๐๐๐ฆ on real data and evaluating on (1) transparent, distributed benchmarks like ๐๐จ๐๐จ๐๐ซ๐๐ง๐ or (2) scalable sim-benchmarks like ๐๐จ๐ฅ๐ฆ๐จ๐๐ฉ๐๐๐๐ฌ is an exciting step toward fairer and more reproducible evaluation of generalist policies, one that the community can hillclimb together to measure progress.
Special thanks to the Ai2 MolmoSpaces team (@notmahi@omarrayyann@YejinKim4 Max Argus) and the RoboArena team (@pranav_atreya) for helping with the set-up and getting these evaluations! Special shout out to @youliangtan@NadunRanawakaA@chuning_zhu, who led these efforts from the GEAR side :)
+ We also release our DreamZero-AgiBot checkpoint & post-training code to enable very efficient few-shot adaptation. Post-train on just ~30 minutes of play data for your specific robot, and see the robot do basic language following and pick-and-place ๐ค(See YAM experiments in our paper for more detail).
++ We also provide the entire codebase & preprocessed dataset to replicate the DreamZero-DROID checkpoint.
๐ https://t.co/pEoQ9QrTHW
๐ป https://t.co/AdHqcLuwIy
RoboArena: https://t.co/NiUQxLLL6F
MolmoSpaces: https://t.co/ZXiguGYPx9
๐ DreamZero training code is LIVE โ train your own WAM (aka VAM)!
๐ง Replicate DROID from-scratch training
๐ Run evals on sim (DROID-Sim, MolmoSpaces, Polaris) & real-world (RoboArena)
No 2 GB200s for real-time inference? No problem โ let NVIDIA carry that burden ๐ช. Sign up for our API and jump into prompting new tasks! (e.g. "fan the burger" ๐, totally unseen verb/task from DROID)
Coming soon: new embodiment/robot fine-tuning initialized from our DreamZero-AGIBot checkpoint. Stay tuned! ๐ค
๐ https://t.co/50wtDaDO8E
We just gave robots "imagination," and the results are wild. ๐คฏ
This robot wasn't trained to untie shoes or shake hands.
It's never seen these tasks before.
It simply "dreams" the future outcome, then acts to make it real. ๐งต๐
New milestone: we trained a robot foundation model on a world model backbone, and enabled zero-shot, open-world prompting capability for new verbs, nouns, and environments. If the world model can "dream" the right future in pixels, then the robot can execute well in motors. We call it "DreamZero", our first World Action Model (WAM).
Our team had tons of fun at the lab typing anything we like into an open text prompt, and watch the robot perform tasks it was never trained on. An emergent capability we didn't quite expect. Obviously not GPT-3 reliable yet, but we are marching into the GPT-2 era.
Discoveries:
- Model and data recipe co-evolve. Compared to VLAs, WAMs learn best from diverse data, breaking away from the conventional wisdom that lots of repeated demos per task are the bread and butter. Diversity >> repetitions.
- X-embodiment is extremely hard. Pixels are the answer. Different robot morphologies traditionally have a hard time sharing knowledge well. But if we put video first, pixels become the universal bridge connecting different hardware - even videos of human first-person view.
DreamZero shows significant robot2robot and human2robot transfer. With only 55 trajectories on a *new*, unseen hardware (~30 min of teleop), it adapts so quickly and retains zero-shot prompting ability.
Yesterday I posted about the "Second Pre-training Paradigm": world models are the next-gen foundation of Physical AI, not language backbones.
Today, we are proving it works. And 2026 has just begun.
Paper: World Action Models are Zero-Shot Policies.
Read it now: (thread)
Introducing DreamZero ๐ค๐ย from @nvidia
> A 14B โWorld Action Modelโ that achieves zero-shot generalization to unseen tasks & few-shot adaptation to new robots
> The key? Jointly predicting video & actions in the same diffusion forward pass
Project Page: https://t.co/qhygDzu6NY
๐งต (1/10)
Excited to introduce PolaRiS, a real-to-sim recipe for turning short real-world videos into high fidelity simulation environments for scalable and reliable zeroshot generalist policy evaluation.
https://t.co/nWcR6YuPf4
(1/N ๐งต)
World models โ action-conditioned predictive models of the environment โ are an exciting are of research for robots that can be useful both for training and for test-time compute. But video-based world models waste a lot of predictive power on reconstructing pixels, which makes model and data requirements much higher and limits how far out into the future their predictions remain viable.
Instead, what if we learned a purely semantic world model, one which predicts which properties will be true about the world after a sequence of actions, without reconstructing the whole images? Jacob Berg tells us more.
Watch Episode #53 of RoboPapers now, with @micoolcho and @chris_j_paxton!
Punchline: World models == VQA (about the future)!
Planning with world models can be powerful for robotics/control. But most world models are video generators trained to predict everything, including irrelevant pixels and distractions. We ask - what if a world model only predicted the semantic information necessary for decision-making?
Introducing Semantic World Models (SWM). Given an observation and an action sequence, SWMs cast modeling as answering textual questions about the future outcome resulting from the actions. Recasting world modeling as a VQA problem lets us directly leverage the pretrained knowledge and machinery of VLMs for generalizable modeling. We had a lot of fun thinking about how this work helps connect these two seemingly very different fields of study - VLMs and world models! ๐งต(1/6)
Paper: https://t.co/KIrRG2JO1a
Fun demo: https://t.co/leogQBvcO0
I'm sadly unable to be at #RSS2025 this year, but my students @prodarhan, @chuning_zhu and @marceltornev will be! Find them presenting some exciting work today, 6/21:
1) @chuning_zhu will present Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Spotlight talk: 4:30-5:30 pm (Bovard auditorium)
Poster: 6:30-8:00pm, poster #50 (Associates park)
Paper: https://t.co/OkNiOHowNS
Website: https://t.co/p6GAyLR3d5
2) @prodarhan and @marceltornev will present Robot Learning with Super-Linear Scaling
Spotlight talk: 5:30-6:30 pm (Bovard auditorium)
Poster: 6:30-8:00pm, poster #58 (Associates park)
Paper: https://t.co/b1ODW6QgjO
Website: https://t.co/fnvrg0YR3n
Hope y'all can make it!
Counterintuitive as it may seem, a lot can be done for RL without rewards! Examples: goal-conditioned RL, unsupervised RL, exploration, world modeling... If you are working on these directions, consider submitting to the RLBrew workshop at RLC!
โ ๏ธ Reminder! Submissions for @RL_Conference's RL beyond Reward Workshop are due May 30 (AoE)!
We are brewing an interesting program and seeking innovative research work in reward-free RL. All papers are welcome, from exploratory abstracts to complete research papers.
@KnightNemo_ Currently weโre using the same noise scale for all frames. We havenโt tried UWM with dense image prediction and different image noise scales. That would be a cool future direction!
Scaling imitation learning has been bottlenecked by the need for high-quality robot data, which are expensive to collect. But are we utilizing existing data to the fullest extent? A thread (1/11)
With UWM, we hope to unify the often disparate paradigms of policy learning and world modeling for large-scale robot learning. Fun collaboration with @yu_raymond5, Siyuan Feng, @Ben_Burchfiel, Paarth Shah, and @abhishekunique7! (11/11)
Project website: https://t.co/T2cnQaok8j
We acknowledge a concurrent work, UVA, which is motivated by the same principles but differs in instantiation. UVA predicts shared latents for videos and actions and uses tokens for masking. UWM is an end-to-end diffusion transformer using noise for masking. (10/11)