AllenAI just released the MolmoAct2 FAST action tokenizer on Hugging Face
Turns continuous robot actions into discrete tokens for training vision-language-action models.
Fully open and trained on millions of trajectories across five embodiments.
@GPTJustin If you have observations/tasks that look very different to the pretraining mix (you do), you'll probably need to train the whole model rather than just the action expert. Very little representation learning happens inside the action expert itself.
I love this @Ultraroboticsco OP1 design, super pragmatic choices. Tbh, the robot-on-a-robot setup maps better to humans' dynamic range and usable workspace size better than a lot of full humanoids.
Robots will have many shapes.
OP1 has all the moves. It was designed to be safe, productive, and easy to deploy.
Many companies we've spoken with are concerned about legged and wheeled robots being tipped over; OP1's not getting knocked down.
Many battery-powered robots need downtime to recharge or swap batteries; OP1 plugs directly into a standard power outlet and runs continuously, never needing to recharge.
Many robots require difficult integrations, such as being bolted down to the floor; OP1 is on locking wheels and can be moved around easily when you need to move it.
At Ultra, we're on a mission to make the world's most useful and deployable robot. OP1 is a big step (or in our case, extension) in that direction.
@HaoruXue It's not really a "rivalry" between VLA vs. WAM (vs. from-scratch) - when you have enough data, the thing that matters the most is how efficiently you learn from it, not what your model used to be before it was a policy. Most likely the best recipe will use parts of both.
Once you have an architecture that is able to soak up lots of data, the two things that determine downstream capabilities are (1) the data, and (2) your scaling constant. Lots of the tricks that would make for good academic papers just stop mattering as much at large data scales.
Really great thread from @lucy_x_shi about the origins of π0.7. We went into the project originally thinking that a hierarchical "world model"-style policy would be a great way to make model better at generalization, but as we've scaled our data that gap largely disappeared.
1/ We just released π0.7 — a steerable generalist robot model with emergent capabilities.
I want to share a bit of the backstory, because π0.7 taught me something surprising about where robot learning is heading. A thread on bittersweet lessons 🧵
For details on how we trained π0.7 (and some more videos of robots doing cool things - we're especially excited about the cross-embodiment transfer results), take a look at our tech report and blog post https://t.co/7egbOXjfut
Excited to share the latest model we've been training: π0.7: a highly steerable model that can be prompted to do almost any task, out of the box!
This robot has never seen this air fryer - in fact, it's never seen *any* air fryer - but with some prompting it can use it perfectly!
It's not only a capable generalist policy, but it can also perform highly dexterous tasks right after pretraining! Here's some videos of the model cutting and peeling vegetables
@julianboolean_@ylecun Not sure what you mean - autoregressive is just a parameterization; it's independent of the training method (teacher-forced vs. on-policy)?
@ylecun@julianboolean_ My pet peeve is posts using this slide to dunk on @ylecun. The (1-e)^n bound is about training LLMs off-policy; RL fixes it (for reasoning) by training on-policy
Kinda lame for a Turing award winner to be snarkposting instead of taking 15s to write a serious explanation though
Partial observability means that a robot policy - even with infinitely many demonstrations - will still be worse than the demonstrator. With MEM, we built a recipe to close this gap.
Fantastic work led by @KarlPertsch@marceltornev@DannyDriess.
We’ve developed a memory system for our models that provides both short-term visual memory and long-term semantic memory.
Our approach allows us to train robots to perform long and complex tasks, like cleaning up a kitchen or preparing a grilled cheese sandwich from scratch 👇