π0.7 handles diverse prompts that don't just say what to do, but also how to do it, including rich language and multimodal information, such as visual subgoal images. At test time, these images can be produced by a lightweight world model.
@Miles_Brundage@davidshustin@jasminewsun no ill will towards the analysis itself, but just to correct the record, it's not affiliated with Physical Intelligence in any way :)
oh you're using VLAs? everyone's using GRPs now. just kidding we're all on LBMs. world models are the future so we developed our own WAM. we're using DVAs. we were using UWMs but our robot caught on fire so we switched to DreamUMVLAPs. we're shipping a robot that passes butter.
@chris_j_paxton@notmahi what evidence is there that the aux loss stuff made a huge difference? from my reading of the paper, there are no abaltions that test a Wan backbone with no video prediction loss
@chris_j_paxton the sheer number of hours is still absolutely miniscule compared to language, which is more significant I think
https://t.co/77plg82xiL
My favorite slide that I made for my talk last weekend -- a very silly thought experiment in which we compare language datasets to robotics datasets (in the most shallow way possible). Yes it is to scale; I learned that the maximum shape size in Keynote is 20,000pts
General-purpose AI models are behind some of the most exciting applications we now can't live without. We envision that an analogous “physical intelligence layer” built with models like π0.6 will similarly spur a new wave of applications for the physical world.
We’ve recently begun working with a handful of companies that have deployed their robots to do real-world, useful things.
https://t.co/udVO9fV0PH
Robots have a "latency" problem. 🤖 💨
@kvablack explains how to use diffusion models and "Action Chunking" to make robot movements seamless—even when the AI is still "thinking."
Watch the full clip on YT! Link in replies.
@kenbwork sure, I mean that "the literal error bar is symmetric when it consists of +-1 SEM". I think most would know that's what I (or Generalist) mean when we say "plotting the standard error".
I know I'm the only robot learning researcher to ever care about statistical rigor, but technically you shouldn't use standard error for a binary success rate. The binomial distribution isn't symmetrical 😅
More pretraining improves GEN-0 real-robot performance (via blind A/B evals with closed-loop rollouts).
Improvements are significant in the low-data regime, but the best models thrive with both pretraining and ample post-training.
See blog addendum: https://t.co/LVBdzMxn0f
@kenbwork you're right that it depends on how they're pooling though. if they're averaging multiple proportions then it's no longer binomial. not sure what you can do then besides do a lot more trials. or maybe just presenting the data per-task (unpooled) is better.
@kenbwork the SE is symmetric bc it relies on the CLT, which is fine for arbitrary distributions and a large enough sample size. but if you have a smaller sample size and you know the distribution is binomial, you can do better (e.g., Wilson score interval)
@aliuahma if you look it up it seems like the rule of thumb is np>10, which it doesn't seem like they have. but in practice I don't see a reason to ever use the normal approximation, especially with proportions near 0 or 1
@Christian061145 it all depends on your constraints. inference-time RTC is still more convenient. however, we already do a lot of post-training so we may as well add something there, and this simple method seems to work well enough. I'm sure ppl will find other methods that work better.
Last week I presented real-time chunking (RTC) at NeurIPS, and we did a live coffee demo the very same evening. To celebrate, we're releasing a (very short) follow-up paper describing a training-time variant of RTC, which is what we've actually been using in our demos!