We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop.
Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate.
Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution.
Our recipe is called "EgoScale":
- Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks.
- Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency.
- Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone.
The scalable path to robot dexterity was never more robots. It was always us.
Deep dives in thread:
We’ve reached 25K hours of real-world egocentric (POV) human activity data.
Covering multiple agents × environments × strategies: the same goal, different paths; the same scene, different decisions.
If your model must generalize, diversity is essential.
@niccruzpatane A big part of why robots can achieve this now is the data.
When training includes large-scale, real-world human behavior with natural corrections, noise, and long-horizon structure, these kinds of behaviors stop being “impressive demos” and start being learnable.
@saranormous A huge part of the “divergence” is data + distribution: what real-world behavior looks like vs. what we train on. The gap is still massively underpriced.
@CyberRobooo DualWorld looks amazing for whole-body control. Massive unscripted human POV data could take its predictive power to the next level. Loving the progress in this space!
We recently assembled a real world POV manipulation demo (home and kitchen) for quick sanity checks.
-Continuous, unscripted human behavior.
-Task level and action level temporal annotations.
-Diverse users and environments.
If you’re working on embodied or manipulation models, this is the kind of data you want to look at.
Happy to share the demo if useful.
For teams looking to go deeper, we can keep expanding diverse, high-value data and tailor the processing pipeline to specific training needs, efficiently at scale.
We’re nearing 5,000 hours of real-world egocentric POV manipulation data.
Collected across different people and real-world environments, with varied object layouts and execution styles, all from natural, unscripted first-person behavior.
Human data matters when it is captured at scale across diverse real-world scenes and behaviors.
At EgoScale, we deliver large-scale real-world egocentric (POV) data across users, scenes, and behaviors. It is already used by multiple teams to train and evaluate real-world policies.
Designed for disciplined decadence.
David Bronze delivers 20g of protein, 150 calories, and 0g of sugar, equating to 53% of its calories from protein. Available in 4 core, indulgent flavors.
Buy 4 cartons on our site, and get the 5th free.
@pathak2206@EdLudlow@CarolineHydeTV@SkildAI This is exactly the area we’ve been working on recently.
We actually put together a small real-world POV demo that made some of these gaps very obvious.
@BrianRoemmele Bimanual fine-motor skills like this need real-world human demos.
EgoScale collects unscripted egocentric POV workflows for embodied robots.