The most important thing to highlight is, XHugWBC controls arbitrary humanoids instead of existing ones. We show it generalizes to existing embodiments in a zero-shot manner. No more efforts are required to build a new controller for new humanoids.
Build your own humanoid robot, XHugWBC can take control.
From H-Zero (https://t.co/zHuO4Di6GN) to XHugWBC (https://t.co/TDLktrPVtS), we find that all humanoids' policies, although structure largely vary, can be learned in a single network.
Key is: 1) Physics-Consistent Morphological Randomization + 2) Universal Cross-Embodiment Representation
Check our project page to see more!
Not saying that's what they see, but if you have data and compute, in the long run, from scratch always wins. Here is one striking example from our distillation pape; blue is random init, yellow is init with sota model *on the same target task*. And I've seen more variants of this over the years.
Can we build generalist robots with zero teleoperation? Come participate in the discussion and weigh in at our ICRA'26 workshop, BeyondTeleop, starting at 8.45 am CEST today (June 5th)!
📍 Strauss 3
Introducing NVIDIA Cosmos 3
We released NVIDIA Cosmos 3 last night.
And today, seeing it take the top spots across 8+ open model leaderboards feels surreal. We spent months working towards this moment.
Here’s the breakdown:
The Leaderboard Wins
World Reasoning
🏆 #1 open model on VANTAGE-Bench for vision AI
🏆 #1 overall on Traffic Anomaly Reasoning (TAR)
World Generation
🏆 #1 open model on Artificial Analysis Image-to-Video leaderboard
🏆 #1 open model on Artificial Analysis Text-to-Image leaderboard
🏆 #1 open model on PAI-Bench for physical AI synthetic data generation
🏆 #1 open model on Physics-IQ, which measures accuracy on physical laws
🏆 #1 open model on R-Bench for world generation quality
World Action
🏆 #1 on RoboArena for specialized policy
🏆 #1 on RoboLab for action generation
But the leaderboards are only part of the story. The real story is why we built Cosmos 3 in the first place.
The Problem
Training robots and autonomous systems in the real world is painfully hard.
Robots need to try the same thing numerous times before they succeed reliably. Self-driving cars need rare edge cases that may never happen naturally. Smart machines need to understand physics, motion, contact, failure, and surprise.
And real-world data is slow, expensive, and sometimes dangerous to collect. At some point, the answer cannot just be “collect more data.”
You can’t collect your way out of an infinite physical world. You have to generate it.
That… was the question behind Cosmos: Can one model understand the physical world deeply enough to reason about it, simulate it, and generate actions inside it?
What We Built
Cosmos 3 is the first omni-model for physical AI. It can understand and generate across: language · images · video · audio · action sequences
It is not just a VLM.
Not just a video generator.
Not just a robot policy model.
It is all of them, in one single model.
That matters because physical AI has been fragmented for a long time. Cosmos 3 is our attempt to collapse that fragmentation.
Depending on how you configure the inputs and outputs, the same model can act as a vision-language model, a video/world generator, a world simulator, or a world-action model.
No separate architecture required.
The Architecture
Under the hood, Cosmos 3 uses a dual-tower Mixture-of-Transformers architecture.
One tower is autoregressive for reasoning. It handles next-token prediction for language and discrete understanding.
The other tower is diffusion-based- for generation. It denoises images, video, audio, and action trajectories.
Two towers. Dual-stream joint attention. One shared world representation.
Each modality gets its own tools: visual encoders, video VAEs, audio VAEs, and action projectors that can map different embodiments into a unified action space.
Action is a first-class modality in Cosmos 3.
That’s what makes it more than a video model. It doesn’t just predict and generate what the world might look like. It can connect reasoning and world modeling to physically grounded action.
Why This Matters
One of the most interesting findings from the ablation work is that training action domains together creates positive transfer.
That means adding more embodiments does not just add more use cases. It can actually make the model better.
This is the heart of why omnimodal training matters.
A shared world representation is not just convenient. It can make each individual task stronger. That’s the part that feels like the beginning of something much bigger.
The part I’m most excited about is that Cosmos 3 is fully open.
Developers get the models, scripts, optimization, inference endpoints, post-training recipes, datasets, and benchmarks.
Everything is available under the Linux Foundation’s OpenMDW 1.1 License.
You can use Cosmos 3 out of the box. You can use the VLM, world model, or world-action pieces separately.
You can post-train it for your own domain, embodiment, or accuracy target.
That’s what makes this feel different.
Cosmos 3 is not just a model release. It is the foundation for building intelligence for autonomous machines.
For me, Cosmos 3 feels like a step toward a world where physical AI development becomes much more scalable and accessible - to a new age of developers and agents.
That’s what we built Cosmos 3 for. I cannot wait to see what you build with it.
Download Models on Hugging Face
https://t.co/LAZoVygeim
Customize Models on GitHub
https://t.co/ZVQBNdqXDD
Read the Tech Blog to Learn More
https://t.co/Hn6Op9YeG1
Introducing NVIDIA Cosmos 3
We released NVIDIA Cosmos 3 last night.
And today, seeing it take the top spots across 8+ open model leaderboards feels surreal. We spent months working towards this moment.
Here’s the breakdown:
The Leaderboard Wins
World Reasoning
🏆 #1 open model on VANTAGE-Bench for vision AI
🏆 #1 overall on Traffic Anomaly Reasoning (TAR)
World Generation
🏆 #1 open model on Artificial Analysis Image-to-Video leaderboard
🏆 #1 open model on Artificial Analysis Text-to-Image leaderboard
🏆 #1 open model on PAI-Bench for physical AI synthetic data generation
🏆 #1 open model on Physics-IQ, which measures accuracy on physical laws
🏆 #1 open model on R-Bench for world generation quality
World Action
🏆 #1 on RoboArena for specialized policy
🏆 #1 on RoboLab for action generation
But the leaderboards are only part of the story. The real story is why we built Cosmos 3 in the first place.
The Problem
Training robots and autonomous systems in the real world is painfully hard.
Robots need to try the same thing numerous times before they succeed reliably. Self-driving cars need rare edge cases that may never happen naturally. Smart machines need to understand physics, motion, contact, failure, and surprise.
And real-world data is slow, expensive, and sometimes dangerous to collect. At some point, the answer cannot just be “collect more data.”
You can’t collect your way out of an infinite physical world. You have to generate it.
That… was the question behind Cosmos: Can one model understand the physical world deeply enough to reason about it, simulate it, and generate actions inside it?
What We Built
Cosmos 3 is the first omni-model for physical AI. It can understand and generate across: language · images · video · audio · action sequences
It is not just a VLM.
Not just a video generator.
Not just a robot policy model.
It is all of them, in one single model.
That matters because physical AI has been fragmented for a long time. Cosmos 3 is our attempt to collapse that fragmentation.
Depending on how you configure the inputs and outputs, the same model can act as a vision-language model, a video/world generator, a world simulator, or a world-action model.
No separate architecture required.
The Architecture
Under the hood, Cosmos 3 uses a dual-tower Mixture-of-Transformers architecture.
One tower is autoregressive for reasoning. It handles next-token prediction for language and discrete understanding.
The other tower is diffusion-based- for generation. It denoises images, video, audio, and action trajectories.
Two towers. Dual-stream joint attention. One shared world representation.
Each modality gets its own tools: visual encoders, video VAEs, audio VAEs, and action projectors that can map different embodiments into a unified action space.
Action is a first-class modality in Cosmos 3.
That’s what makes it more than a video model. It doesn’t just predict and generate what the world might look like. It can connect reasoning and world modeling to physically grounded action.
Why This Matters
One of the most interesting findings from the ablation work is that training action domains together creates positive transfer.
That means adding more embodiments does not just add more use cases. It can actually make the model better.
This is the heart of why omnimodal training matters.
A shared world representation is not just convenient. It can make each individual task stronger. That’s the part that feels like the beginning of something much bigger.
The part I’m most excited about is that Cosmos 3 is fully open.
Developers get the models, scripts, optimization, inference endpoints, post-training recipes, datasets, and benchmarks.
Everything is available under the Linux Foundation’s OpenMDW 1.1 License.
You can use Cosmos 3 out of the box. You can use the VLM, world model, or world-action pieces separately.
You can post-train it for your own domain, embodiment, or accuracy target.
That’s what makes this feel different.
Cosmos 3 is not just a model release. It is the foundation for building intelligence for autonomous machines.
For me, Cosmos 3 feels like a step toward a world where physical AI development becomes much more scalable and accessible - to a new age of developers and agents.
That’s what we built Cosmos 3 for. I cannot wait to see what you build with it.
Download Models on Hugging Face
https://t.co/LAZoVygeim
Customize Models on GitHub
https://t.co/ZVQBNdqXDD
Read the Tech Blog to Learn More
https://t.co/Hn6Op9YeG1
Humanoids need data. Lots and lots of data.
Introducing HumanoidMimicGen: a method that automatically generates 1000s of humanoid loco-manipulation demonstrations from a single teleoperated demonstration.
We are back again :) After three weeks of quiet building.
Introducing Genesis World 1.0, our latest simulation platform, the second release in our full-stack suite. Open-sourced.
Robotics is still bottlenecked by the 1× speed of the physical world. Every model, checkpoint, and data recipe eventually needs to be tested on physical hardware, slowly, expensively, and with limited coverage.
One hour in reality can become 100 days in simulation. That is how robotics model iteration moves from a wall-clock bottleneck to a compute problem.
To make this work, simulation has to be both fast and trustworthy.
Over the past year, we rebuilt the entire stack: a GPU-accelerated cross-platform compiler, penetration-free multi-physics contact solvers, unified rigid and deformable physics, and a photo-realistic renderer purpose-built for physical AI applications.
We built Nyx, a high-performance path-traced rendering engine for robotics application.
Genesis World 1.0 achieves near realtime performance with our latest development for penetration-free IPC solver, supporting various types of deformables beyond rigid bodies. It supports contact-rich, dexterous manipulation simulation across different embodiments: unitree, sharpa, wuji, genesis hand and various types of grippers.
Under the hood is Quadrants, our effort in pushing forward cross-platform GPU-accelerated computation. Quadrants started as a fork of Taichi, and we rebuilt most of the critical parts for optimizing simulation workloads, giving 10x faster launch time and up to 4.6x runtime performance compared to the initial Genesis release.
Together, they bring us to an unprecedentedly low sim-to-real gap, enabling zero-shot real-to-sim model evaluation and much faster iteration of GENE.
All available today.
Genesis World 1.0: https://t.co/aknCM3eqws
Quadrants: https://t.co/uXqPNI4cb6
Nyx: https://t.co/R8j0djqGnV
AMI Labs just raised $1.03B. World Labs raised $1B a few weeks earlier. Both are betting on world models.
But almost nobody means the same thing by that term.
Here are, in my view, five categories of world models.
---
1. Joint Embedding Predictive Architecture (JEPA)
Representatives: AMI Labs (@ylecun), V-JEPA 2
The central bet here is that pixel reconstruction alone is an inefficient objective for learning the abstractions needed for physical understanding. LeCun has been saying this for years — predicting every pixel of the future is intractable in any stochastic environment. JEPA sidesteps this by predicting in a learned latent space instead.
Concretely, JEPA trains an encoder that maps video patches to representations, then a predictor that forecasts masked regions in that representation space — not in pixel space.
This is a crucial design choice.
A generative model that reconstructs pixels is forced to commit to low-level details (exact texture, lighting, leaf position) that are inherently unpredictable. By operating on abstract embeddings, JEPA can capture "the ball will fall off the table" without having to hallucinate every frame of it falling.
V-JEPA 2 is the clearest large-scale proof point so far. It's a 1.2B-parameter model pre-trained on 1M+ hours of video via self-supervised masked prediction — no labels, no text. The second training stage is where it gets interesting: just 62 hours of robot data from the DROID dataset is enough to produce an action-conditioned world model that supports zero-shot planning. The robot generates candidate action sequences, rolls them forward through the world model, and picks the one whose predicted outcome best matches a goal image. This works on objects and environments never seen during training.
The data efficiency is the real technical headline. 62 hours is almost nothing. It suggests that self-supervised pre-training on diverse video can bootstrap enough physical prior knowledge that very little domain-specific data is needed downstream. That's a strong argument for the JEPA design — if your representations are good enough, you don't need to brute-force every task from scratch.
AMI Labs is LeCun's effort to push this beyond research. They're targeting healthcare and robotics first, which makes sense given JEPA's strength in physical reasoning with limited data. But this is a long-horizon bet — their CEO has openly said commercial products could be years away.
---
2. Spatial Intelligence (3D World Models)
Representative: World Labs (@drfeifei)
Where JEPA asks "what will happen next," Fei-Fei Li's approach asks "what does the world look like in 3D, and how can I build it?"
The thesis is that true understanding requires explicit spatial structure — geometry, depth, persistence, and the ability to re-observe a scene from novel viewpoints — not just temporal prediction.
This is a different bet from JEPA: rather than learning abstract dynamics, you learn a structured 3D representation of the environment that you can manipulate directly.
Their product Marble generates persistent 3D environments from images, text, video, or 3D layouts. "Persistent" is the key word — unlike a video generation model that produces a linear sequence of frames, Marble's outputs are actual 3D scenes with spatial coherence. You can orbit the camera, edit objects, export meshes. This puts it closer to a 3D creation tool than to a predictive model, which is deliberate.
For context, this builds on a lineage of neural 3D representation work (NeRFs, 3D Gaussian Splatting) but pushes toward generation rather than reconstruction. Instead of capturing a real scene from multi-view photos, Marble synthesizes plausible new scenes from sparse inputs. The challenge is maintaining physical plausibility — consistent geometry, reasonable lighting, sensible occlusion — across a generated world that never existed.
---
3. Learned Simulation (Generative Video + Latent-Space RL)
Representatives: Google DeepMind (Genie 3, Dreamer V3/V4), Runway GWM-1
This category groups two lineages that are rapidly converging: generative video models that learn to simulate interactive worlds, and RL agents that learn world models to train policies in imagination.
The video generation lineage. DeepMind's Genie 3 is the purest version — text prompt in, navigable environment out, 24 fps at 720p, with consistency for a few minutes. Rather than relying on an explicit hand-built simulator, it learns interactive dynamics from data. The key architectural property is autoregressive generation conditioned on user actions: each frame is generated based on all previous frames plus the current input (move left, look up, etc.). This means the model must maintain an implicit spatial memory — turn away from a tree and turn back, and it needs to still be there. DeepMind reports consistency up to about a minute, which is impressive but still far from what you'd need for sustained agent training.
Runway's GWM-1 takes a similar foundation — autoregressive frame prediction built on Gen-4.5 — but splits into three products: Worlds, Robotics, and Avatars. The split into Worlds / Avatars / Robotics suggests the practical generality problem is still being decomposed by action space and use case.
The RL lineage. The Dreamer series has the longer intellectual history. The core idea is clean: learn a latent dynamics model from observations, then roll out imagined trajectories in latent space and optimize a policy via backpropagation through the model's predictions. The agent never needs to interact with the real environment during policy learning.
Dreamer V3 was the first AI to get diamonds in Minecraft without human data. Dreamer 4 did the same purely offline — no environment interaction at all. Architecturally, Dreamer 4 moves from Dreamer’s earlier recurrent-style lineage to a more scalable transformer-based world-model recipe, and introduced "shortcut forcing" — a training objective that lets the model jump from noisy to clean predictions in just 4 steps instead of the 64 typical in diffusion models. This is what makes real-time inference on a single H100 possible.
These two sub-lineages used to feel distinct: video generation produces visual environments, while RL world models produce trained policies.
But Dreamer 4 blurred the line — humans can now play inside its world model interactively, and Genie 3 is being used to train DeepMind's SIMA agents.
The convergence point is that both need the same thing: a model that can accurately simulate how actions affect environments over extended horizons.
The open question for this whole category is one LeCun keeps raising: does learning to generate pixels that look physically correct actually mean the model understands physics? Or is it pattern-matching appearance? Dreamer 4's ability to get diamonds in Minecraft from pure imagination is a strong empirical counterpoint, but it's also a game with discrete, learnable mechanics — the real world is messier.
---
4. Physical AI Infrastructure (Simulation Platform)
Representative: NVIDIA Cosmos
NVIDIA's play is don't build the world model, build the platform everyone else uses to build theirs.
Cosmos launched at CES January 2025 and covers the full stack — data curation pipeline (process 20M hours of video in 14 days on Blackwell, vs. 3+ years on CPU), a visual tokenizer with 8x better compression than prior SOTA, model training via NeMo, and deployment through NIM microservices.
The pre-trained world foundation models are trained on 9,000 trillion tokens from 20M hours of real-world video spanning driving, industrial, robotics, and human activity data.
They come in two architecture families: diffusion-based (operating on continuous latent tokens) and autoregressive transformer-based (next-token prediction on discretized tokens). Both can be fine-tuned for specific domains.
Three model families sit on top of this.
Predict generates future video states from text, image, or video inputs — essentially video forecasting that can be post-trained for specific robot or driving scenarios.
Transfer handles sim-to-real domain adaptation, which is one of the persistent headaches in physical AI — your model works great in simulation but breaks in the real world due to visual and dynamics gaps.
Reason (added at GTC 2025) brings chain-of-thought reasoning over physical scenes — spatiotemporal awareness, causal understanding of interactions, video Q&A.
---
5. Active Inference
Representative: VERSES AI (Karl Friston)
This is the outlier on the list — not from the deep learning tradition at all, but from computational neuroscience.
Karl Friston's Free Energy Principle says intelligent systems continuously generate predictions about their environment and act to minimize surprise (technically: variational free energy, an upper bound on surprise).
Where standard RL is usually framed around reward maximization, active inference frames behavior as minimizing variational / expected free energy, which blends goal-directed preferences with epistemic value. This leads to natural exploration behavior: the agent is drawn to situations where it's uncertain, because resolving uncertainty reduces free energy.
VERSES built AXIOM (Active eXpanding Inference with Object-centric Models) on this foundation.
The architecture is fundamentally different from neural network world models. Instead of learning a monolithic function approximator, AXIOM maintains a structured generative model where each entity in the environment is a discrete object with typed attributes and relations.
Inference is Bayesian — beliefs are probability distributions that get updated via message passing, not gradient descent. This makes it interpretable (you can inspect what the agent believes about each object), compositional (add a new object type without retraining), and extremely data-efficient.
In their robotics work, they've shown a hierarchical multi-agent setup where each joint of a robot arm is its own active inference agent. The joint-level agents handle local motor control while higher-level agents handle task planning, all coordinating through shared beliefs in a hierarchy. The whole system adapts in real time to unfamiliar environments without retraining — you move the target object and the agent re-plans immediately, because it's doing online inference, not executing a fixed policy.
They shipped a commercial product (Genius) in April 2025, and the AXIOM benchmarks against RL baselines are competitive on standard control tasks while using orders of magnitude less data.
---
imo, these five categories aren't really competing — they're solving different sub-problems.
JEPA compresses physical understanding.
Spatial intelligence reconstructs 3D structure.
Learned simulation trains agents through generated experience.
NVIDIA provides the picks and shovels.
Active inference offers a fundamentally different computational theory of intelligence.
My guess is the lines between them blur fast.
@LightwheelAI closed $100M in Q1 2026 orders.
This marks the start of Physical AI at scale.
Two forces are converging: frontier model teams need high-quality data at scale, and industrial companies need systems built for deployment.
Both point to the same requirement: a continuous infrastructure loop across simulation, data, evaluation, and deployment.
Lightwheel marks this shift by turning Physical AI infrastructure into a deployment engine.
Read more: https://t.co/ErblYTlqAm
#Robotics #PhysicalAI #EmbodiedAI #IndustrialAI #Automation
Reviewer designations are also being finalized. The top 25% of reviewers are "gold," and receive free registration to ICML 2026. The next 25% of reviewers are "silver." These designations will be taken into account in financial aid applications. 3/3
First blog post up on Robotics Simulation Infrastructure! I give a high-level overview, followed by an elementary example of better infrastructure for pose management. Link in thread below
We started Thinking Machines to advance human-AI collaboration, and this is our first bet on what that looks like. Most labs treat autonomy as the goal and interactivity as scaffolding around a turn-based core. We think the way we work with AI matters as much as how smart it is. Interactivity has to be in the model, and it has to scale with intelligence rather than trail behind it.
https://t.co/U4c0uC7tnT
Unitree Unveils: GD01, A Manned Transformable Mecha, from $650,000 👏
The world's first production-ready manned mecha. It can transform. It's a civilian vehicle. It weighs ~500kg with you inside.
Please everyone be sure to use the robot in a Friendly and Safe manner.
Unitree Unveils: GD01, A Manned Transformable Mecha, from $650,000 👏
The world's first production-ready manned mecha. It can transform. It's a civilian vehicle. It weighs ~500kg with you inside.
Please everyone be sure to use the robot in a Friendly and Safe manner.