🚀 #ICLR2026 Oral 💥
How can we design world models that capture object interactions directly from pixels?
Introducing Latent Particle World Models-the first end-to-end self-supervised, object-centric world model, trained from videos, supporting action/img/lang conditioning.
1/n
VLA is 95% certain about current action. Will it 95% succeed in the task?
Obviously, not necessarily. But if you’re clever, you can *calibrate* action prob. to task success.
Our #ICML2026 paper formulates this + SOTA algorithms based on new connection to RL temporal differences
I’m attending #ICLR2026 in Rio this week to present LPWM! Friday April 24 Poster Session 3 10:30, Oral Session 4B 3:15p. Happy to chat about self-sup object-centric learning and world models. I’ll be on the job market soon and looking for exciting opportunities!
#ICLR@iclr_conf
🚀 #ICLR2026 Oral 💥
How can we design world models that capture object interactions directly from pixels?
Introducing Latent Particle World Models-the first end-to-end self-supervised, object-centric world model, trained from videos, supporting action/img/lang conditioning.
1/n
I will be at #ICLR2026 this week to present our work on Hierarchical Entity-centric Reinforcement Learning!
Come by our poster (Thursday Poster Session 2 P4-#4712) and reach out anytime to talk about #ReinforcementLearning#WorldModels#HierarchicalRL
🚀 Excited to share ViPRA: Video Prediction for Robot Actions
📍 Accepted to #ICLR2026@iclr_conf
🏆 Best Paper — #NeurIPS2025 Embodied World Models Workshop
Robot learning today still needs millions of action labeled videos.
Yet videos are abundant — from humans and the web — but lack action labels. Meanwhile, pretrained video models already learn rich dynamics.
ViPRA is a recipe for turning pretrained video models into robot policies while enabling robot learning to scale with actionless videos.
🧵 Thread ↓
🎤 Dr. @TalDaniel8 (@CarnegieMellon ) In a deep dive
w/ @ceciletamura of @ploutosai
How can AI discover objects, model uncertainty, and predict the future from raw video alone?
🔴 [https://t.co/yimiKyJeYV](https://t.co/yimiKyJeYV)
During training, the posterior latent actions condition the dynamics module that predicts the next-frame prior.
A KL regularization term aligns this prediction with the latent policy’s output, forming a VAE-style objective over particle transitions.
7/n
The inverse dynamics observes particles at t and t+1, inferring the latent actions that caused the change.
The latent policy sees only particles at t and outputs a distribution over possible latent actions from the current state.
6/n
🚀 #ICLR2026 Oral 💥
How can we design world models that capture object interactions directly from pixels?
Introducing Latent Particle World Models-the first end-to-end self-supervised, object-centric world model, trained from videos, supporting action/img/lang conditioning.
1/n
To address this, we introduce a context module that predicts latent actions per particle, enabling fine-grained, multi-entity dynamics
It has two heads: (1) an inverse dynamics (posterior) and (2) a latent policy (prior).
5/n
Building a world model means capturing stochastic particle dynamics.
Existing “latent action” models help, but (1) need strong regularization (e.g., VQ) and (2) rely on a single global latent—missing interactions among multiple entities.
4/n
DLP decomposes scenes to particles with several attributes (keypoints, bounding-boxes, masks), fully unsupervised.
These act as visual “tokens,” making cross-modal long-horizon reasoning (vision ↔ language) far more natural than the standard pixel patches.
3/n