Thanks for sharing our work 🙏
We definitely have a polished teaser video coming…just currently too tired and still in the “diffusion model sampling 1000 steps” phase…
Fast Spatial Memory with Elastic Test-Time Training
@ziqiao_ma, @Xueyang_Y, Haoyu Zhen, @YuncongYY, Joyce Chai, @gan_chuang
tl;dr: fast-weight module keeps anchor parameters and estimates their importance through an online Fisher-style statistic
https://t.co/l2afVEIQ6N
和@sainingxie 一起挑战7小时播客!他刚和Yann LeCun踏上“世界模型”的创业旅程(AMI Labs)。这是他第一次Podcast、第一次访谈。
2026年2月雪后的一天,我们在纽约布鲁克林,从下午2点,开启了一场始料未及的马拉松式访谈,直到凌晨时分散去。
这篇访谈的中文标题叫做《逃出硅谷》,但他又不厌其烦地枚举了影响他学术生涯的每一个人,并反反复复口头描摹这些人的人物特征(侯晓迪、何恺明、杨立昆、李飞飞…)正是这些,让这篇“逃出硅谷”的对话充斥着人性的温度。
By the way, 下面是访谈的YouTube版本,我们提供了中英字幕。
And yes, 我们是在用播客给这个世界建模😎
A 7-hour podcast with Saining Xie. He has just begun a new journey on world models with Yann LeCun at AMI Labs.
This was his first podcast appearance and his first long-form interview.
A day after the snowfall in February 2026, in Brooklyn, New York, we started recording at 2 p.m. What followed became an unexpected marathon conversation that lasted until the early hours of the morning.
The Chinese title of the interview is “Escaping Silicon Valley.” Yet throughout the conversation, he patiently listed the people who shaped his academic life, repeatedly sketching their personalities in vivid detail: Hou Xiaodi, Kaiming He, Yann LeCun, Fei-Fei Li, and others. These portraits are what give this “escape from Silicon Valley” conversation its human warmth.
By the way, the YouTube version of the interview is below, with Chinese and English subtitles.
And yes, we are using podcasts to model the world 😎
A 7-hour marathon interview with Saining Xie: World Models, AMI Labs, Ya... https://t.co/3rTwdTGkJI 来自 @YouTube
Very much agree. Citing @TairanHe99, plus some additional cases:
Tesla FSD:
• Scale: 4.3M hours of driving data.
• Cost: new data every day for free.
LLMs:
• Scale: ~15 trillion text tokens (roughly 10+ million books' worth of text).
• Cost: Scraped from the public internet over decades for essentially "free", plus millions of daily uses providing RLHF.
Image Generation:
• Scale: 5.85 billion image-text pairs (the size of the open-source LAION-5B dataset, supposedly much more in SOTA industrial models).
• Cost: Scraped continuously from alt-text and images across the public web for free.
Video Generation:
• Scale: Millions of hours of video data uploaded to platforms like YouTube.
• Cost: Scraped for free from the every single day uploads by billions of global creators.
Each of them has a scalable data ecosystem.
So, what do we have to train a 3D foundation model? Supposedly, that means having a large-scale, pre-trained AI model designed to understand, generate, and interact with three-dimensional spatial data across a wide variety of downstream tasks.
The Genie/VAM approach might be promising. Start with a strong base model pretrained on scalable data resources, then fine-tune on smaller expert data. I actually think many are already doing this.
In my recent blog post, I argue that "vision" is only well-defined as part of perception-action loops, and that the conventional view of computer vision - mapping imagery to intermediate representations (3D, flow, segmentation...) is about to go away.
https://t.co/aFmE9CHHau
🧐Applying world models to improve real-world policy on challenging manipulation tasks used to be considered out of reach.
😌After sustained effort, we’re now seeing encouraging progress.
🚀Thrilled to introduce RISE: Self-Improving Robot Policy with Compositional World Model
https://t.co/eP1EOmk2X1
https://t.co/mEa2SsZAZ1
RISE is, to our knowledge, the first work to use a world model as an effective learning environment for challenging real-world manipulation, enabling policy improvement on tasks that demand high dynamics, dexterity, and precision.
Incredible teamwork with @lin_kunyang111@francislee2020@YueXiangyu@HaoZhao_AIRSUN@smch_1127
Genie 3 🤝 @Waymo
The Waymo World Model generates photorealistic, interactive environments to train autonomous vehicles.
This helps the cars navigate rare, unpredictable events before encountering them in reality. 🧵
We’re excited to introduce the Waymo World Model—a frontier generative mode for large-scale, hyper-realistic autonomous driving simulation built on @GoogleDeepMind’s Genie 3.
By simulating the “impossible”, we proactively prepare the Waymo Driver for some of the most rare and complex scenarios—from tornadoes to planes landing on freeways—long before it encounters them in the real world.
https://t.co/EbMut47ZEY
Thrilled to launch Project Genie, an experimental prototype of the world's most advanced world model. Create entire playable worlds to explore in real-time just from a simple text prompt - kind of mindblowing really! Available to Ultra subs in the US for now - have fun exploring!
🔥 Very excited to share that we’re releasing LingBot-World 🌍 @robbyant_brain — an open-source frontier world model!
We’re pushing the limits of:
🔹 High-Fidelity Simulation & Precise Control
🔹 Long-Horizon Consistency & Memory
🔹 Modeling Physical & Game Worlds
The most surprising part? The emergence of sophisticated behaviors that go beyond simple video generation.
👇I’m obsessed with this dragon demo 🐉. It can rollout for 1 min while maintaining crisp visual dynamics and consistent memory!
Introducing RTFM (Real-Time Frame Model): a highly efficient World Model that generates video frames in real time as you interact with it, powered by a single H100 GPU.
RTFM renders persistent and 3D consistent worlds, both real and imaginary.
Try our demo of RTFM today!
What if you could not only watch a generated video, but explore it too? 🌐
Genie 3 is our groundbreaking world model that creates interactive, playable environments from a single text prompt.
From photorealistic landscapes to fantasy realms, the possibilities are endless. 🧵
"Generate in Parts" just got better with "Split in Parts"! 🥹😇
@hervenivon added a toggle in the @Scenario_gg 3D viewer that instantly splits generated meshes into an exploded view.
See how PartCrafted breaks down your model - no need to download and check in Blender anymore.
Preprint of (not) today: Lin and Lin et al., "MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second" -- https://t.co/Zenwqtq1V4
Feed-forward VGGT + Splats/Motion estimation heads, trained also with rendering & motion estimation losses. Multitask training improves all.