To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem.
At Rhoda, we are solving it by reformulating robot policies as video generation.
Today, we introduce the Direct Video-Action Model (DVA)
I'll be at CVPR next week (6/3–6/7). If you’re working on or exploring opportunities in video models for robotics (research or engineering), happy to chat 🤖
We’re also hosting a Rhoda party Thursday night with many of our technical team in town. DM me for an invite 🍻
How? Existing video models aren't optimized for real-time inference.
Instead of fine-tuning off-the-shelf video models, we co-design inference-aware model architectures and model-aware inference optimizations from the ground up.
Can a large foundation video model run as a real-time robot policy at the edge, on a single RTX 5090?
• ✅ No quantization
• ✅ No distillation
• ✅ Full denoising (all the way from noise to clean video)
We just proved it's possible. 👇🎬
The future we're building toward is one where robots adapt to new tasks in seconds.
At Rhoda, we tackle real-world problems through fundamental research.
Full story + technical deep-dive: https://t.co/WA9oO65qzE
Teaching a robot a new task typically means stopping operations, collecting teleoperated demonstrations, and retraining. That process takes hours at a minimum. We wanted to know if we could collapse it to seconds — from a single human demo, on the fly, no retraining required.
Early research preview: we can.
How it works: we train on paired human demo and robot execution data. Because our DVA, FutureVision, has long-context visual memory built in (https://t.co/J3veqMf4Kp), we prepend the full human video into the model's context and predict robot actions closed-loop. The model watches a human do something once and understands what to do next.
Here’s something we’ve never seen done before.
Real-world tasks are long and ambiguous. Solving them requires visual memory and state tracking. Most robot policies only see the last few frames. Ours doesn't.
We put our DVA, FutureVision, to the perfect testbed: the shell game 🐚. The DVA nails it.
Here’s something we’ve never seen done before.
Real-world tasks are long and ambiguous. Solving them requires visual memory and state tracking. Most robot policies only see the last few frames. Ours doesn't.
We put our DVA, FutureVision, to the perfect testbed: the shell game 🐚. The DVA nails it.
How? Our DVA implements robot policy as future video generation.
Given the context, the model generates future videos (bottom left) predicting not just the correct cup to pick up, but even the appearance of the hidden object.
Native training on long, continuous videos gives the model built-in long-context memory.
"I don't think the world is going back to non video based pretraining." Our CEO @startupjag spoke with @bheater at @a3automate on why video is the foundation for robots that actually work in production. https://t.co/qmBWXnWEAj
1/ We are speed running industrial robotics.
It took us just 19 days from the first day of data collection to filming a 2.5-hour continuous run of our model autonomously breaking down industrial containers — zero human intervention.
The data efficiency of our DVA model is fundamentally changing how fast we bring robots out of the lab and into the factory.
Autonomous operation with 3 hours of data collection at a customer factory.
3/ Achieving a 100% autonomous rate in a 2.5-hour continuous run means the model needs to handle all kinds of edge cases. Whether it's pulling a drifted box back into range or re-attempting a failed flip, the model self-corrects in real-time.
-> The trash is out of reach. The robot must reposition the box before attempting another grab.
-> The door won't fall open. The robot recognizes a latch probably wasn't fully released and goes back to fix it.
-> The first flip fails. The robot doesn't hesitate — it goes for a second attempt.
-> The box has drifted too far to reach the latch. The robot pulls it back into range.
Most robot demos are “golden runs”: a perfect take selected from many attempts.
But real-world deployment is about Continuous Operation.
Watch our DVA model tackle a real-world decanting task for 1.5 hours straight: Uncut, Zero human intervention.
🧵👇
Trained on just 11 hours of robot data, our model is surprisingly robust, thanks to web-scale pre-training.
It doesn't just avoid errors; it handles them. If the lid tears off, it finds a new way to grip. If a bearing is stuck, it shakes the bag loose.
Watch our robot navigate through these corner cases: 👇