A personal update: After two years at 1X, I’m moving on to something new.
I joined 1X to solve general-purpose robotics, through the lens of evaluation.
We bet on humanoid world models early in 2024. I’m proud of our work showing how the 1X World Model can solve the offline evaluation problem: judging policy quality by accurately predicting expected state and reward within the test-time distribution.
We've then showcased how these same world models can leverage their understanding of robot manipulation to act as policies, generalizing far beyond the tasks in training data.
Scaled deployments of robots in homes requires confidence in policy performance in unknown environments, and generalization across environments and skills.
To my colleagues at 1X, It's been an honor working with you all. I'm inspired by the world-class team and humanoid that we’ve assembled and continue to assemble. In California, we’ve grown from a few robotics researchers in a one room office to a campus for large-scale manufacturing and research engineering.
I’ve now joined the founding team at Project Prometheus as a member of technical staff.
I've also moved up to San Francisco! Reach out if you'd like to grab coffee and chat AI in the physical world.
Tuned into @itsdanielho (@1x_tech) on @RoboPapers podcast geeking out over 1XWM—inspiring! "Dream success first, then reverse-engineer the actions" paradigm is 🔥 and lol it applies to non-robots too! My takes↓
1️⃣ World Model perfectly predicted the future action, and it was extremely close to reality, due to action-conditioned video generation (on precise low-level action sequences). At execution time, Inverse Dynamics Model (IDM) back-infers actions to ensure the “dreamed perfect trajectory” can be grounded in reality. controllable + grounded + zero-shot
2️⃣ Egocentric large-scale mid-training is useful because diversified data expands distribution coverage. Scalable and low-cost.
3️⃣ Granular training (second-by-second). Use VLM for caption upsampling, from coarse task to second-by-second play-by-play. Similar to Sora fine-grained prompt engineering, but more applicable to robot control. Granularity makes the world model capture causal chains, not just spectacle.
4️⃣ Both success and failure videos are used to train the world model. Success videos reinforce correct physics, failure videos provide negative examples. This makes imagination robust: the model can generate diverse futures (including bad ones), and a value function selects the best.
5️⃣ World model evaluating world model (recursive eval) is interesting. Current 1XWM can do self-eval (model-evaluating-model): Generate multiple rollout videos; Use internal value function or visual signals to estimate success probability; Execute highest-scoring trajectory. A more advanced loop may be: Use WM rollouts as synthetic data to predict success rate for ablating training data; Retrain/improve WM; Offline policy optimization (Dreamer-style million dream iterations). Instead of directly learn policy and rely on real rollouts for eval (expensive), using World Model to do dream-time eval / in-simulation assessment can be scalable to break through the data wall and generalize exponentially.
6️⃣ Inverse Dynamics Model (IDM) is a bridging component to translate World Model video sequences into executable low-level robot actions. It's cerebellum/translator. Given adjacent generated frames, it infers the action commands required to transition from frame A to B. World Model generates multiple rollouts with stochastic sampling, then IDM performs frame-to-frame inversion to recover action sequence and candidate trajectories, applying rejection sampling to discard some dreams where inferred actions violate kinematic constraints and ask WM regenerates. Training IDM separately is more efficient (on smaller precise data), while WM is pretrained on massive data (strong generalization). This architecture enables video prior + grounded embodiment. Instead of directly VLA End-to-End action prediction, WM + IDM "imagine-then-invert" paradigm is like "dream success first, then reverse-engineer actions", with higher visual alignment in zero-shot long-horizon tasks and easier offline evals.
👉🏻https://t.co/1VUvkAIPoY
Check out this @RoboPapers pod for an overview of the past year of our world model research @1x_tech! We're very excited about world model architectures to achieve truly generalizable robot policies and evaluators.
NEO will be able to zero-shot tasks in homes, learn rapidly with autonomy data, and predict how freshly baked models perform. This will usher in the era of home robots.
Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot generalization on a wide range of tasks. Humanoid company @1x_tech has a solution: world models.
1X Director of Evaluations @itsdanielho joins us on RoboPapers to talk about:
- why world models are the future for scaling robot learning
- how to use world models for robot control
- what world models unlock for evaluating robot model performance
- how we can hill-climb from here to general purpose robots
Watch Episode #61 of RoboPapers, with @micoolcho and @chris_j_paxton, now!
Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot generalization on a wide range of tasks. Humanoid company @1x_tech has a solution: world models.
1X Director of Evaluations @itsdanielho joins us on RoboPapers to talk about:
- why world models are the future for scaling robot learning
- how to use world models for robot control
- what world models unlock for evaluating robot model performance
- how we can hill-climb from here to general purpose robots
Watch Episode #61 of RoboPapers, with @micoolcho and @chris_j_paxton, now!
World model based polices like 1XWM we shared yesterday enables preference feedback during post-training and also test-time compute, because the model generates interpretable state
One of the unlocks from this new type of architecture below the headlines
One of many next steps at @1x_tech: preference learning for world-model-based policies.
Given a generated starting frame, we can sample multiple video rollouts from our WM and use preference feedback to steer the model toward higher-quality behavior.
This lets us fix policy failures in synthetic worlds—resolving bad NEO behaviors with generated dogs before we ever meet real ones.
@_joe_harris_ in our blog post we show side-by-side comparisons between generations and real rollouts for a bunch of tasks: https://t.co/uAGWCKShhS
Next up we will speed up model inference and minimize latency and re-plan when conditions drift
Excited to share our latest work on world models as robot policies!
NEO executes novel manipulation tasks from text prompts, deriving actions from text-conditioned video generation.
We found strong alignment between world model generations and real rollouts, and sufficient controllability to control NEO accurately.
1/n
One of the coolest examples we found is NEO holding up a peace sign
WM both understands what a peace sign is and is self aware (no hands in starting frame) + the IDM extracts finger level actions :)
@christyjestin @ridcursion Because NEO’s embodiment is so close to human form, we found promising zero-shot transfer even without overlap on the task-specific data. For example we have 98.5% pick and place and tested transfer which wroked well
@PotEl0000@btfdNOID@1x_tech Good question, you’re correct that our current world model work doesn’t solve these delayed and higher level tasks. Stay tuned for orchestration work where we solve things like this!