@iyanmoonyang I think vision is mainly used for trajectory planning. Your brain can still adjust your movements using only sensory feedback from the ground
Robot learning is moving beyond policies built for one robot, one scene, one task.
At MIT, we’re exploring a different path: turning video world models into embodiment-agnostic robot policies.
Introducing VERA: a 14B video-to-action system that controls robots across embodiments, skills, and environments.
From zero-shot pick-and-place on a real Panda arm to contact-rich cube reorientation with a 16-DoF robotic hand.
Different robots. Different environments. Different tasks.
Same video planner. Same weights.
We’re open-sourcing everything so you can fine-tune VERA for your own robot setup too. Deep dive in the thread:
🔗 https://t.co/hzuYZ2m5lS
🧵 (1/7)
We recently wrote a short blog on the mathematical essence behind three common World Model paradigms in Robot Learning.
It looks at Future-conditioned / IDM-style, Single-backbone, and MoT-style models from the lens of probabilistic modeling and structured optimization.
Working in robotics right now is what I imagine working with language models felt like in 2023. Everyone throwing things at the wall to see what sticks
Pixel prediction (Cosmos), action prediction (VLA), reward prediction (TD-MPC), and representation prediction (JEPA). Different paths for the same problem
The recipe that won in language was self-supervised pretraining at internet scale then light finetune on top. Only representation prediction runs that playbook. It learns from action-free video data so you can pretrain on YouTube and egocentric data then add a control layer. Everything else needs action-labeled data that doesn't scale
As an RL maximalist, I used to hate LeCun's cake. Turns out he was right all along which is how I ended up a JEPA truther
Can a model trained purely on video — with zero action labels — match VLAs trained on massive action-labeled datasets?
Meet µ0 (Mew-Zero): a world model that learns a "physical language" for robots.
Here's why we're excited 🧵
Is it possible to use the same model to do this and do laundry, for example? One of the main problems, I think, is how we can achieve really high-frequency policies
Researchers from The University of Hong Kong and Kinetix AI have developed a humanoid robot system called SMASH that can play real table tennis using only onboard cameras.
The robot tracks the ball in real time without using external cameras or motion-capture systems.
It can perform powerful smashes, quick side movements, and low crouching saves using full-body coordination.
We are super excited to share with you our initial release of Lucky Engine. We are building a robotics engine from the ground up to be what we wished we could find in a simulator before
💥Introducing FACTR 2, learning external force sensing on commodity robot arms without needing dedicated sensors.
We show that learned force signals enable force-feedback teleop on low-cost arms and improve BC policies.
FACTR 2 consists of:
1. Neural External Torque (NEXT): learns external forces without needing dedicated force sensors.
2. Force-Informed Re-Sampling Training (FIRST): uses the learned force signal to identify task-critical regions and upsample them during training.
w/ @StevenOh_@_tonytao_
🧵(1/N)
We ran 300 fully autonomous live demonstrations over 3 days at ICRA 2026.
The task: a humanoid navigating stairs, picking up a box from the floor and placing it on a table. Simple to describe, but hard to execute reliably when your robot is making every decision on its own at a conference with new surroundings and a crowd watching live.
This is just a glimpse. We've been pushing our stack much further and we'll be sharing more very soon.
More information in the thread.
#HumanoidRobots #ICRA2026 #Flexion
Excited to share our recent work on whole-body humanoid locomotion for challenging terrain traversal!
Diffusion-based planner + RL WBC = general purpose locomotion controller
Led by @ctki49@mxu_cg@KehanWen170077 at @leggedrobotics and @xbpeng4.