Introducing SERF: a spatiotemporal environment and robot feature map for long-horizon mobile manipulation.
We demonstrate that conditioning a mobile manipulation policy on a SERF map enables long-horizon reasoning.
New work with @nvidia: evaluating robot policies entirely inside a world model. The policy acts, the model imagines the consequences, and the imagined evals predict real-world results. 🧵
real vs world-model rollout side by side📷
Hot take: robots should not dream in pixels.
Pixels are too low-level.
Latents are too opaque.
μ₀ predicts a third thing:
3D motion traces.
On real robots, it beats π₀.₅ — with ~1/100 the data scale and no action labels for world-model pretraining. 🧵
https://t.co/UfmrqNlBtw
(this video features voiceover narration)
I am very happy to share the result of my internship at FAIR (Meta): V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning with @ylecun@AdrienBardes
Our approach learns dense, spatially coherent features from video while preserving strong global understanding
Is pixel prediction the best way to build a world model?
Check out VDAWorld, an alternative path to building interpretable, editable, and physically grounded world models.
We use a VLM to build a simulation of the scene with the help of a computer vision toolbox.
1/ General-purpose robotics is the rare technological frontier where the US / China started at roughly the same time and there's no clear winner yet.
To better understand the landscape, @zoeytang_1007, @intelchentwo, @vishnuman0 and I spent the last ~8 weeks creating a deep dive on humanoid robotics hardware and flew to China to see the supply chain firsthand.
Here's everything we've created + our takeaways about the components, humanoid comparisons, supply chains, and geopolitics👇
🚀 AGIBOT Genie Sim 3.0: First LLM-Driven Open-Source Simulation Platform for Embodied AI @ CES 2026
Integrated with NVIDIA Isaac Sim, AGIBOT’s all-new Genie Sim 3.0 (https://t.co/LcGXCa8168) debuts at CES - delivering a unified toolchain for digital asset generation, scene generalization, data collection, and automated evaluation.
Key highlights:
✅ 10,000+ hours of synthetic dataset (real-world robot operation scenarios) — the largest open-source dataset for embodied AI.
✅ High-fidelity environments via 3D reconstruction + visual generation fusion.
✅ LLM-driven tech: Generate massive simulation scenes & metrics in minutes.
✅ 200+ tasks across 100k+ scenarios for comprehensive model evaluation.
Fully open-source (code, data, assets: https://t.co/CmKe4FEAUb) - accelerating model development, cutting physical hardware reliance, and fueling embodied AI innovation!
#AGIBOT #GenieSim3 #AGIBOTG2 #EmbodiedAI #OpenSource #CES2026 #AGIBOTatCES2026 #NVIDIA #HumanoidRobot #Robotics #Innovation
Long horizon robotics is hard.
Robots drift, forget the goal, and fall apart when a task takes hundreds of steps.
This small independent team hit rank 1 on Stanford’s Behavior1K benchmark by Fei Fei Li’s lab…
Ahead of teams from NVIDIA, CMU, and several big labs.
What they found useful:
✅ Split each task into stages so the policy always knows where it is
Keeps the robot on track for long sequences
✅ Predict a chunk of actions instead of step by step
Reduces jitter and creates smoother motion
✅ Add correlated noise during training
Matches real robot behavior better and makes the policy more stable
✅ Use many samples per forward pass
Lowers variance and helps when trajectories are long
✅ Keep high level context isolated from noisy signals
Prevents the model from getting confused mid task
These ideas transfer well to real robots:
Stage awareness, chunked execution, realistic noise, and clean attention flow all reduce drift and make long tasks possible.
All built by a small independent group:
@IliaLarchenko@akashkarnatak, and Gleb Zarin. Thanks for pointing out, @abhishekX697.
I have no scientific verdict on the full method, but the ideas are interesting and worth reading.
—-
Weekly robotics and AI insights. Subscribe free: https://t.co/dsa6wcvq6n
Latent Theory of Mind:
Meet LatentToM, a powerful framework for robots to understand and predict human intentions in real-time. - - learning compressed latent representations of human mental states
- seamless collaboration and adaptive behavior,
- 3x more sample-efficient.
Fei-Fei Li (@drfeifei) on limitations of LLMs.
"There's no language out there in nature. You don't go out in nature and there's words written in the sky for you.. There is a 3D world that follows laws of physics."
Language is purely generated signal.
Check out our study in @NatMachIntell, where we use AI copilots for brain-computer interfaces (BCIs): https://t.co/IWhprtemFR. We enable a paralyzed participant to neurally control a computer cursor and a robotic arm (non-invasive, no surgery). Video here: https://t.co/dZG8h6xb9C
🛠️ What if a robot could invent its own tools. And teach itself how to use them?
That’s exactly what VLMgineer does:
a new framework that lets Vision Language Models (VLMs) design physical tools and the actions to use them, entirely on their own.
No templates.
No human demonstrations.
Just raw, AI-driven creativity.
Why it matters
✅ Co-designs tools and actions together using VLMs, ensuring tight coupling between form and function
✅ Uses VLM-guided evolution (not random search) to refine designs intelligently
✅ Outperforms human-designed tools by +64.7% in task success across 12 RoboToolBench challenges
✅ Produces better-than-everyday tools for real manipulation tasks—measured in success rate and elegance
It builds on the emerging trend of large-model-guided evolutionary design (like Eureka and AlphaEvolve) and brings it into physical robotics.
It opens the door to general-purpose, automated hardware design, no strong priors needed.
Code & paper: https://t.co/3pXxpQR4E1
Can AI help understand how the brain learns to see the world?
Our latest study, led by @JRaugel from FAIR at @AIatMeta and @ENS_ULM, is now out!
📄 https://t.co/y2Y3GP3bI5
🧵 A thread:
The team at @NVIDIAAIDev really nailed it with ViPE!! This SLAM system generated pose, depth, point clouds, etc, in a few seconds! Now to get the output to work with GEN3C...
If you are interested in running this yourself, it took about 5 minutes to get up and running from scratch. https://t.co/BXMVnyguR8
Our paper on learning controllable 3D robot models from vision is published in Nature! Huge congrats to Lester and the team, @annan__zhang, @BoyuanChen0, Hanna Matusik, Chao Liu, and Daniela Rus!!
Learning joint world models for the environment & the agent is super exciting :)
A video generator must satisfy 3 criteria to be a world model:
1️⃣ Causality: Past affects future, not vice versa.
2️⃣ Persistence: The world shouldn't change because you looked away.
3️⃣ Constant Speed: Simulation shouldn't slow down over time.
We believe SSMs are a natural fit: inherent alignment with the causal structure of time, supporting long-context, and constant per-frame generation speed.