Julen Urain

@robotgradient

Robotics Tinkerer. RS @Amazon FAR Prev: @META (FAIR), @DFKI, @TUDarmstadt X

Joined November 2017

1.4K Following

1.4K Followers

287 Posts

robotgradient retweeted

RSS Pioneers @RSSPioneers

about 1 month ago

We are excited to announce the 2026 cohort of RSS Pioneers! This year’s cohort brings together an outstanding group of early-career researchers whose work spans the breadth of robotics. A heartfelt thank you to all the organizers who made this year’s program possible.

RSSPioneers's tweet photo. We are excited to announce the 2026 cohort of RSS Pioneers! This year’s cohort brings together an outstanding group of early-career researchers whose work spans the breadth of robotics. A heartfelt thank you to all the organizers who made this year’s program possible. https://t.co/iZlUzlD4Av

Julen Urain

@robotgradient

2 months ago

@abhishekunique7 Impressive!!

255

robotgradient retweeted

Yixuan Wang

@YXWangBot

3 months ago

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at https://t.co/SBqVDzYn86. NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

500

327

125K

robotgradient retweeted

Jitendra MALIK

@JitendraMalikCV

4 months ago

Pretraining with dynamics models of motor behavior (aka world models) from video will be much more central to robotics than VLMs. There are multiple choices of representations (e.g. 3D? JEPA?) but we will figure this out by and by. Exciting times!

364

188

83K

Who to follow

Chenhao Li

@breadli428

Robotics @GoogleDeepMind | Embodied intelligence and robot learning | Doctoral fellow @ETH_AI_Center, @leggedrobotics | Prev. @MIT, @ETH_en, @MPI_IS.

Joe Watson

@JoeMWatson

phd researcher in robotics & machine learning for control @DFKI @ias_tudarmstadt @TUDarmstadt previously @DeepMind intern, @CMRSurgical, @Cambridge_Eng

Georgia Chalvatzaki

@GeorgiaChal

Professor @CS_TUDarmstadt, @hessian_AI, AI Emmy Noether @dfg_public, #ERCStG SIREN, #AlfriedKruppFörderpreis, co-chair TC @MobileManip & chair WiE @ieeeras

robotgradient retweeted

Wenlong Huang @ CVPR

@wenlong_huang

4 months ago

Fully agreed with the sentiment that much of computer vision research (concretely, those not for “human consumption”) should be grounded in robotics. But as a robotics researcher, I think the more nuanced question is: how can we *rethink* these intermediate representations for embodied intelligence rather than discarding them? Why? The challenge, as also pointed out in Vincent’s article, is precisely the lack of perception-action data at scale. This is why intermediate representations IMO are *preferable rather than obsolete* because they open up training from scalable data sources. This can include even the vision/language encoders people love and use in robot learning — it’s hard to imagine training low-level visual representation or high-level language understanding purely from limited robot data. The same goes for intermediate representations at the structure level — world modeling, learning from Internet videos, learning from humans, and simulation — many of which still rely on 3D representations too.

11K

robotgradient retweeted

Jitendra MALIK

@JitendraMalikCV

4 months ago

At the RI seminar at CMU yesterday, I presented a 3 level analysis of robot skills & discussed the pros and cons of teleoperation, simulation, and learning from videos, before presenting our research. Enjoy! https://t.co/wUPe3QlTRk

355

236

102K

Julen Urain

@robotgradient

4 months ago

@artemZholus It reminds me to TD learning or even GAIL. I am not convinced of bootsrapping for generative models .

Julen Urain

@robotgradient

4 months ago

@chris_j_paxton @notmahi Mahi is the mastermind to change the robotics paradigm!

105

robotgradient retweeted

Hao Zhang

@HaoZhang623

4 months ago

As video world models become increasingly powerful, do we still need explicit 3D? A commonly misunderstood point is this: video world models are not “just 2D.” Their ability to maintain multi-view consistency, temporal stability, and realistic interaction necessarily implies that their latent knowledge encodes 3D world structure. Without some notion of 3D, consistency itself would not be possible. The real distinction, therefore, is not whether a model has 3D but whether that 3D exists implicitly or explicitly. Implicit 3D lives inside latent spaces and network weights. It supports generation, but it is difficult to localize, edit, constrain, or reason about. It allows the world to exist, but not to be used. Explicit 3D, in contrast, exists as structure and state: it is addressable, editable, composable, and transferable. Its purpose is not better visual fidelity, but operability to allow the world to be manipulated, controlled, and executed. From this perspective, video and 3D are not competing paradigms but a layered system: 2D/video is the interface to human perception; 3D is the interface to the physical world. They can reinforce each other, but neither forms a closed loop on its own. In practice, data not model architecture sets the upper bound of world models. Explicit 3D may not be the final user-facing representation, but it is likely the most effective pathway toward scalable, high-quality, and controllable data. Through explicit 3D/4D representations, worlds can be constructed systematically: interactions can be programmatically sampled, states and actions can be composed, rendered into images and videos, and fed back to train video world models. Seen this way, 3D is not the destination it is the starting point for scaling. What truly drives progress forward is never the model itself. Whether we capture the world or imagine new ones, whether data comes from observation or intent, whether we model what is or what should be the direction of the world is ultimately determined by human choice and purpose. Models may extend the world, but humans decide where it goes. #Genie3 #worldmodel

258

135

44K

Julen Urain

@robotgradient

4 months ago

There has been a clear trend in the last months moving from VLA-type approaches to Video Generative Models + Inverse Dynamics Models (VAM). While the probable main reason of this recent growth is the latest improvements in video generative models, I believe this shift is relevant for robotics. While the VLA's distill the foundation models knowledge through some latent representations that intertwine semantic and spatial information, VAM distill this knowledge in a more explicit way, representing it spatially. I believe this spatial grounding of VAM might lead to way larger generalization capabilities wrt. VLA and I am optimistic in even more 3D spatially grounded foundation models in the direction of the recent @wenlong_huang https://t.co/PAsihtci7Y

141

121

20K

Julen Urain

@robotgradient

4 months ago

@junjungoal Very cool! Happy to see the Value-function based approach works that well! Very refreshing approach in front of end-2-end generative model approaches :)

116

Julen Urain

@robotgradient

4 months ago

@DrJimFan Dwarfs Fortress is the perfect fit for this 🥹🥹

Julen Urain

@robotgradient

4 months ago

While I really liked the article, it feels to me that this physical commonsense can be better capture by predicting next observations (i.e. world models) and planning on it, rather than training a policy on predicting next action (i.e. behavioral cloning)

Andy Zeng

@andyzengineer

4 months ago

https://t.co/6a1IWWbhoo

457

464

83K

510

Julen Urain

@robotgradient

4 months ago

@jparkerholder The spatial-temporal consistency looks suberb! If solid, the implications for robotics are huge! Imagine generating training environments on-the-fly from natural language. Excited to see how this evolves toward embodied agent training.

Julen Urain

@robotgradient

4 months ago

@drfeifei 100% on the boat of 3D/4D world models!Generative 3D environments could unlock much broader domain randomization and edge case coverage. I am anyway curious how the physics fidelity compares to hand-crafted sims for contact-rich manipulation tasks.

150

Julen Urain

@robotgradient

5 months ago

@wenlong_huang @Stanford @nvidia Woow! This is soo cool! Congrats! 3D world Models are 🔥

331

robotgradient retweeted

Irmak Guzey @irmakkguzey

5 months ago

We just released AINA, a framework for learning robot policies from Aria 2 demos, and are now open-sourcing the code: https://t.co/HSHrtUrt11. It includes: ✅ Aria 2 data processing into 3D observations like shown ✅Training of point-based policies ✅Calibration Give it a try!

141

28K

Julen Urain

@robotgradient

7 months ago

This was very challenging and very cool to see evolve! I personally was no sure if it would work, but @irmakkguzey pushed so hard to show it does. Learning dexterous robot policies with only human video data, using the egocentric view from Aria2 glasses, chill and easy 😁

Irmak Guzey @irmakkguzey

7 months ago

Dexterous manipulation by directly observing humans - a dream in AI for decades - is hard due to visual and embodiment gaps. With simple yet powerful hardware - Aria 2 glasses 👓 - and our new work AINA 🪞, we are now one significant step closer to achieving this dream.

153

41K

897

robotgradient retweeted

Bingyi Kang

@bingyikang

7 months ago

After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3 reveals two key insights: 💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture. ✨ A single depth-ray representation is enough. No complex 3D tasks. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. The core team members, aside from me: @HaotongLin, Sili Chen, Jun Hao Liew, @donydchen. 👇(1/n) #DepthAnything3

493

514K

Julen Urain

@robotgradient

7 months ago

@Ed__Johns Super impressive and a lot of congratulations 😊

414

Julen Urain

@robotgradient

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users