Agreed. What we ultimately want isn't superintelligence for robots, nor VLAs, nor world models — it's something like the human brain: compact, flexible, fundamental, and running on 20 watts.
Great observation — and yes, that's exactly the tension. The Compression Gap paper identifies what the encoder needs to capture (contact geometry, affordances, spatial dynamics) but deliberately stops short of prescribing how. You're right that world models with video backbones are one compelling answer — DreamZero's results are consistent with our framework precisely because their video diffusion backbone acts as a continuous-pathway encoder that has internalized spatiotemporal priors from web-scale data. But I'd push back slightly on collapsing "better encoder" into "world model." There's a spectrum: you could have a purpose-built vision encoder trained with physics-aware objectives that doesn't do full next-frame prediction, or you could go all the way to a WAM.
@observie Thanks for the reply! I’m assuming you’re using Robstride,and I think understanding the capacitance of these motors is important for designing a pre-charge circuit — but I don’t think Robstride’s datasheet actually specifies that capacitance. How do you think about this problem?
Japan has the unique foundation to build anything, but we’re barely tapping into it. For the scale of our supply chain, we have far too few manufacturers, and our industrial focus is shrinking. In the last ten years, almost no hardware startups have dared to take on frontier challenges. Now, we’re at a point where even vital humanoid parts are hard to find at home.
These problems are solvable if we face them head-on—which is exactly what I’m doing. For geopolitical reasons, we must not let our supply chain rot. In the next decade, I will make Japan the world’s ultimate supply chain: the only country on earth that can build everything from the tiniest component to the final product entirely in-house.
Going global from Day 1 is essential. But 'distribution hacks' aren't the answer—they only get you slightly better versions of the same old results. We must build something fundamentally great and prove its value to the entire world.
Prove it. Throw the proof in their faces.
More research is needed on post-training for VLA, as its importance is currently underestimated. BC-centric VLA is robust to distribution shift only when combined with well-known approaches such as offline-to-online RL; outside these regimes, it tends to break down. What is needed is an evaluator or reward model that can adapt to a much broader range of distributions.
The reason real-world RL appears to work so effectively is that imitation learning (IL), which provides the initialization, already places the policy inside the basin of attraction of a high-performing solution. In other words, this kind of RL fundamentally depends on IL.
Therefore, if there is anything we are overlooking, it lies in the algorithms, architectures, and ultimately the data-design trade-offs of IL itself.
Is real-world RL becoming a cheat code for robot tasks?
If we take a task A, run imitation learning, then fine-tune with real-world RL, the task will almost certainly work. So what am I missing?
(We’re not talking about generalization to task B here.)
You might find it interesting that increasing the amount of data in VLA leads to faster action speeds. This is thought to result from optimizing the variance error in behavior mapping and the consequent reduction in replanning. Maybe I’ll write a blog post about this.
One of the people who has influenced me the most appeared in this interview. After reading Casey’s blog, I started to think deeply about things that at first seemed difficult, and he’s the one who led me into the world of hardware.
Casey Handmer (@CJHandmer) on Cheeky Pint discusses his solar-maximalist worldview, Henry Kaiser, hard tech, why Hyperloop was doomed by physics from the start, and his plan to refill the Salton Sea.
Timestamps:
00:00 Intro
02:28 Henry Kaiser
08:49 Introducing Terraform
13:08 Where electricity won’t work
16:50 The solar maximalist perspective
22:57 Terraformer Mark One
27:49 The role of intervention
37:30 American dynamism
47:36 The Origins of Efficiency, by Brian Potter
48:33 Children and education
55:15 Desalination
01:08:16 Lessons from leadership