Check out my most recent project that I've been working on for more than half a year -- co-led with Tim @TimSong52005757!
We showed a flexible and effective recipe for improving generalist VLAs leveraging non-robotic foundation models! The key is guidance in 3D!
More details @ https://t.co/aBauZS5TiT
VLA policies learn generalist robot behaviors from massive teleoperation datasets, hoping that the right behavior emerges. But they rarely use perception during training or inference: powerful foundation models of 3D geometry, semantics, or human motion are ignored.
@TimSong52005757, @LongLeRobot, and our @GRASPlab team introduce Omniguide, based on a simple idea: Instead of retraining policies, let perception guide action at inference time.
We express diverse guidance sources as attractive and repulsive energy fields in 3D space and inject their gradients into the generative process of VLA policies.
This lets perception modules steer actions without retraining the policy: A modular path toward composable robot intelligence.
Project: https://t.co/5Vqvz1t8TR
Paper: https://t.co/IoYzKrOzKM
Personal news: I’m joining @GoogleDeepMind NYC this summer as a Student Researcher on the Robotics team!
4 years ago, I left Google to start my PhD. Now I get to come back and work on humanoid robot learning.
Full loop closure, as the SLAM nerds would say. NYC folks—say hi!
Really excited to release mjviser, a web-based MuJoCo viewer, powered by Viser. It has almost all the features of the native MuJoCo viewer, but runs in your browser. Load and simulate any MuJoCo model with a single uv command 👇
uvx mjviser <model.xml>
Excited about this project: we showed how, for challenging tasks, a VLA can "get by with a little help from its friends"😉 --- powerful perception models that infer geometry and more, by constructing 3D guidance fields for diffusion and flow policies.@LongLeRobot@TimSong52005757
It was a lot of fun to see this in person. We see so much online, but theres nothing nearly as convincing as just doing a demo, first try, right in front of someone.
And its cool work too, addressing a serious shortcoming of current policies
For this project, we’ve also optimized the speed significantly and shown real-time demos to several visitors coming to Penn including @chris_j_paxton and @DJiafei. Here’s a video of the robot finishing a task in 20 secs autonomously
VLA policies learn generalist robot behaviors from massive teleoperation datasets, hoping that the right behavior emerges. But they rarely use perception during training or inference: powerful foundation models of 3D geometry, semantics, or human motion are ignored.
@TimSong52005757, @LongLeRobot, and our @GRASPlab team introduce Omniguide, based on a simple idea: Instead of retraining policies, let perception guide action at inference time.
We express diverse guidance sources as attractive and repulsive energy fields in 3D space and inject their gradients into the generative process of VLA policies.
This lets perception modules steer actions without retraining the policy: A modular path toward composable robot intelligence.
Project: https://t.co/5Vqvz1t8TR
Paper: https://t.co/IoYzKrOzKM
Scaling VLAs with more robot data is just not enough.
OmniGuide shows you can fix generalist policies at inference time.
Add guidance fields in 3D space that attract toward goals and repel from obstacles, and steer the policy without retraining.
So honored to have the support of my professors and peers, including the omnipotent Long @LongLeRobot, on my first PhD project. 3D space is the bridge between the task and action space, where the guidance from foundation knowledge flows.
Here is our recent project to enhance generalist policies with auxiliary information guidance!
Steering your base policy, so it’s more performant and effective
VLA policies learn generalist robot behaviors from massive teleoperation datasets, hoping that the right behavior emerges. But they rarely use perception during training or inference: powerful foundation models of 3D geometry, semantics, or human motion are ignored.
@TimSong52005757, @LongLeRobot, and our @GRASPlab team introduce Omniguide, based on a simple idea: Instead of retraining policies, let perception guide action at inference time.
We express diverse guidance sources as attractive and repulsive energy fields in 3D space and inject their gradients into the generative process of VLA policies.
This lets perception modules steer actions without retraining the policy: A modular path toward composable robot intelligence.
Project: https://t.co/5Vqvz1t8TR
Paper: https://t.co/IoYzKrOzKM
TAMP vs End2End, which one is better?
Check out our latest research ablating these two on tabletop pick-and-place setup, it turns out the SOTA foundation models provide very good prior that solve this task family.
Please enjoy the download-and-play TipTop from MIT folks!
Introducing Tether 🪢, a fun little idea to scale data by having our robot “play” in the real world for over 24 hours, throughout the day and overnight—improving policies from zero to mastery with minimal supervision!
But play is messy, with out-of-distribution scenarios that are hard to anticipate. To perform autonomous functional play in the real world, from just a handful of demos, we propose a highly robust few-shot imitation method that warps demo trajectories using visual correspondences. Then, continuously running it within a multi-task VLM-guided cycle, we generate a data stream that produces 1000+ expert-level demos. This generated data is finally funneled downstream to train imitation learning policies, which improve from zero to near-perfect success rates.
We’ll be presenting Tether at #ICLR2026 in just a few weeks! But before that, deep dive with me… 🧵
Why do generalist robotic models fail when a cup is moved just two inches to the left? It’s not a lack of motor skill, it’s an alignment problem. Today, we introduce VLS: Vision-Language Steering of Pretrained Robot Policies, a training-free framework that guides robot behavior in real time.
Check out the project: https://t.co/9xE68JPLUv
👇🧵 (Watch till the end: VLS runs uncut, steering pretrained policies across long-horizon tasks.)
Happy to announce our neurips’25 paper, real world RL of active perception behaviors!
I am pretty excited about this project - I learned that real world robot RL is actually quite straightforward. Details below: