So honored to have the support of my professors and peers, including the omnipotent Long @LongLeRobot, on my first PhD project. 3D space is the bridge between the task and action space, where the guidance from foundation knowledge flows.
VLA policies learn generalist robot behaviors from massive teleoperation datasets, hoping that the right behavior emerges. But they rarely use perception during training or inference: powerful foundation models of 3D geometry, semantics, or human motion are ignored.
@TimSong52005757, @LongLeRobot, and our @GRASPlab team introduce Omniguide, based on a simple idea: Instead of retraining policies, let perception guide action at inference time.
We express diverse guidance sources as attractive and repulsive energy fields in 3D space and inject their gradients into the generative process of VLA policies.
This lets perception modules steer actions without retraining the policy: A modular path toward composable robot intelligence.
Project: https://t.co/5Vqvz1t8TR
Paper: https://t.co/IoYzKrOzKM
🌟Your static 3D world models are now alive and interactable!
🚀Introducing NeuROK, a neural simulation framework that turns any static 3D object into an interactive 4D asset — no per-category physics, no physical annotations for training.
📄 https://t.co/PSAILjHmZb
🧵 1/n
Excited to share that our work NeuralActuator: Neural Actuation Modeling for Robot Dynamics and External Force Perception has been accepted to #RSS2026!
Your robot — even a low-cost one — can feel external forces without torque or tactile sensors.
TL;DR: NeuralActuator is a neural actuator model that jointly predicts 1️⃣torque to capture the nonlinear and time-varying current–to–torque relationship of low-cost servos, 2️⃣external contact forces (and force detection gates) for sensorless force perception, 3️⃣and motor conditions that indicate each motor’s operating regime.
Here is a fast-forward video clip ⬇️ We are also covering more robots like LeRobot-S101 and Franka Panda.
More details coming soon.
I’m so tired of writing rebuttals to this kind of “lack of novelty” review: “This paper trivially combines A, B, and C, so the algorithmic novelty is limited.”
Technically, most (if not all) robotics papers are convex combinations of existing ideas.
I still deeply appreciate A+B+C papers—especially when they deliver:
- New capabilities: the “trivial combination” unlocks behaviors we simply couldn’t achieve before
- Sensible & organic design: A+B+C is clearly the right composition—not some arbitrary A′+B+C′
- Nontrivial interactions: careful analysis of the dynamics, coupling, or failure modes between A, B, C
- Rehabilitating old ideas: A was dismissed for years, but paired with modern B/C, it suddenly works—and teaches us why
- System-level & "interface" insight: the contribution is not any single piece, but how the pieces talk to each other
- Scaling laws or regimes: identifying when/why A+B+C works (and when it doesn’t)
- Engineering clarity: making something actually work robustly in the real world is not “trivial”
- New problem formulations: sometimes the real novelty is in the reformulation—only under this view does A+B+C make sense.
Maybe worth keeping these in mind when reviewing the next A+B+C paper : )
Excited about this project: we showed how, for challenging tasks, a VLA can "get by with a little help from its friends"😉 --- powerful perception models that infer geometry and more, by constructing 3D guidance fields for diffusion and flow policies.@LongLeRobot@TimSong52005757
It was a lot of fun to see this in person. We see so much online, but theres nothing nearly as convincing as just doing a demo, first try, right in front of someone.
And its cool work too, addressing a serious shortcoming of current policies
For this project, we’ve also optimized the speed significantly and shown real-time demos to several visitors coming to Penn including @chris_j_paxton and @DJiafei. Here’s a video of the robot finishing a task in 20 secs autonomously
Scaling VLAs with more robot data is just not enough.
OmniGuide shows you can fix generalist policies at inference time.
Add guidance fields in 3D space that attract toward goals and repel from obstacles, and steer the policy without retraining.
Check out my most recent project that I've been working on for more than half a year -- co-led with Tim @TimSong52005757!
We showed a flexible and effective recipe for improving generalist VLAs leveraging non-robotic foundation models! The key is guidance in 3D!
More details @ https://t.co/aBauZS5TiT
VLA policies learn generalist robot behaviors from massive teleoperation datasets, hoping that the right behavior emerges. But they rarely use perception during training or inference: powerful foundation models of 3D geometry, semantics, or human motion are ignored.
@TimSong52005757, @LongLeRobot, and our @GRASPlab team introduce Omniguide, based on a simple idea: Instead of retraining policies, let perception guide action at inference time.
We express diverse guidance sources as attractive and repulsive energy fields in 3D space and inject their gradients into the generative process of VLA policies.
This lets perception modules steer actions without retraining the policy: A modular path toward composable robot intelligence.
Project: https://t.co/5Vqvz1t8TR
Paper: https://t.co/IoYzKrOzKM
@20Kamio@JiahuiLei1998@CrossEntropi@LingjieLiu1@KostasPenn Hi! We compared our method with GSTex on the NeRF synthetic dataset. I communicated with the author, that their method can not run on all the scenes of the mipnerf360 dataset, so that's not reported on both their paper and ours.
Capture a scene with 2DGS that can be viewed from extremely close and far away viewpoints, without aliasing while maintaining detailed texture. Check our new work https://t.co/VP3MasjC7J lead by @TimSong52005757@CrossEntropi and advised by me, @LingjieLiu1 and @KostasPenn
Introducing #ECCV2024 work FiT3D: Improving 2D Feature Representations by 3D-Aware Fine-Tuning.
2D foundation models are awesome - but we live in a 3D world. How to inject 3D awareness into 2D foundation models?🤔In FiT3D, we first lift 2D foundation features (e.g. DINOv2) into a 3D Gaussian representation for each scene. Then we use the rendered 3D-aware features of multiple scenes to finetune the 2D foundation model. We show that semantic features fused into 3D representations can in turn effectively improve 2D foundation models.
💻 Code: https://t.co/2xrkQToNmt
🚀 Project: https://t.co/HIwQlEb7ez
🤗 Demo: https://t.co/JG2JOqVJjR
With @_anurag_das, @FrancisEngelman, @SiyuTang3, and @janericlenssen
#ETHZurich #MPI_INF #GoogleAI
Fasten your seat belts. Michael from @radiancefields and I will host a weekly X Space discussing exciting developments with researchers and creators in the radiance field and GenAI community, starting next Thursday.
Our first guest will be @scannerian1, one of the world's leading experts in capturing 3D scenes from images. We plan to create an interactive community where we can share knowledge and grow together.
We hope you will join and participate to make this space a success. We have some exciting guests lined up. Stay tuned and let's have fun together sharing the latest insights.