Video diffusion models learn motion indirectly through pixels.
But motion itself is much lower-dimensional.
We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics.
This enables efficient planning -> 10,000× faster than video models.
🧵👇
⚠️ Standard first stages are not sufficient for safety-critical applications!
The most extreme weather events are often the hardest to decode.
One latent → many plausible reconstructions
Deterministic decoders hide that uncertainty.
Meet FREUD 🧵👇
Check out our work on how to scale NVS on internet-scale data! We provide fixes to the unsupervised NVS pipeline (RayZer) and also obtain more interpretable pose estimations while simplifying the overall setup.
The internet is full of video. So why can't novel view synthesis just scale on it?
Real-world video is simultaneously unposed, messy, and dynamic, breaking self-supervised NVS.
We fixed that. RayDer learns static-scene NVS from dynamic internet video, scaling like an LLM. A🧵
💡 Training with differently noised patches increases overall image gen performance, as the model learns a better underlying representation.
This holds even for plain Euler sampling, but their sampler increases the gap even more!
Diffusion models treat every part of an image equally.
→ Same number of steps. Same compute.
But images aren’t uniform. 🤔
Some regions are easy, others are hard.
So why force the model to treat them the same? 🧵
@nurvai_ai You need periodic regrounding, and that's also what we do for LIBERO. You usually also have a translation error from converting tracks into actions that a robot can actually execute, which you also have to compensate for.
Video diffusion models learn motion indirectly through pixels.
But motion itself is much lower-dimensional.
We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics.
This enables efficient planning -> 10,000× faster than video models.
🧵👇
Stop predicting motion step-by-step. Model the whole motion in a compact representation for efficient planning.
📄 Paper: https://t.co/S51t6kxMqY
💻 Models: https://t.co/MohDxwjpz0
Joint work with @KoljaBauer, @StefanABaumann, @itsbautistam, Josh Susskind, and Björn Ommer.
Amazing work led by @rmsnorm@KoljaBauer and our collaborators at LMU, to be presented at @CVPR! Personally, I find this question of "what's the right level of abstraction for planning in physical space?" to be very intriguing. Pixels over time are very low SNR (ie. the argument behind JEPA) but motion/trajectories carries a lot on information while being extremely compressible. I believe there's a lot more to uncover from this direction. Very glad to be part of this one!
@KoljaBauer@StefanABaumann@itsbautistam 1️⃣https://t.co/2wWMnibvRa
Also, shoutout to two other recent works that explore how to use point tracks for world modeling.
👇...
You don't imagine the future by mentally rendering a movie. You trace how things move -- abstractly, sparsely, step by step.
We built a model that does exactly this. It predicts motion, not pixels -- and it's 3,000× faster than video world models.
Myriad, accepted at @CVPR 2026
@KoljaBauer@StefanABaumann@itsbautistam 1️⃣https://t.co/2wWMnibvRa
Also, shoutout to two other recent works that explore how to use point tracks for world modeling.
👇...
What’s the right representation for a world model? 3D, pixels, or something else?
Excited to release our new paper “Forecasting Motion in the Wild” where we propose point tracks as tokens for generating complex non-rigid motion and behavior
From @GoogleDeepmind@Berkeley_AI@TTIC_Connect
Do we really need pixel generation to model motion? 🤔
We show how directly representing motion in a compact space enables efficient, scalable planning.
10,000× faster than video models, enabling planning and reasoning in open-world and robotics settings.
Check it out ⬇️