Last week I had the opportunity to visit NVIDIA and Netflix to share our work in a talk titled “𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴 𝗕𝗲𝘆𝗼𝗻𝗱 𝗦𝗰𝗮𝗹𝗲: 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆, 𝗚𝘂𝗶𝗱𝗮𝗻𝗰𝗲, 𝗮𝗻𝗱 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹𝗶𝘁𝘆”.
Check out the slides:
https://t.co/6G0ai0rbio
We’re open-sourcing the models, weights and training code for PERSIST, our 1.7B parameter 3D world model!
By modelling a dynamic 3D world state instead of relying on pixel histories, PERSIST generates experiences that remain spatially and temporally coherent over thousands of steps.
Tom, @kaixin20578389 and I will present PERSIST at @icmlconf in Seoul next month. See you there!
Links in thread⬇️
#WorldModels #ICML2026
Based on Marc Andreessen & a16z views on what built Silicon Valley, the half dozen key things are typically:
1. Elite research universities producing talent & breakthroughs (Stanford model).
2. Deep pools of risk-tolerant venture capital funding bold ideas.
3. Culture of ambition, rapid iteration & tolerance for failure.
4. Open immigration attracting top global talent.
5. Light, predictable regulation enabling fast experimentation.
6. Strong rule of law, property rights & contracts.
Culture & institutional shifts are usually the hardest for governments to implement.
Most robotics RL paper is often just imitation learning in disguise. The "human expert" transfer task through extensive reward shaping, curricula, initialization strategies, environment design, and various tricks. You are providing demonstrations--just indirectly.
A reward function is just a demonstration written in a different language.
EDIT MOTION IN VIDEOS!!! Quit prompting and start directing
I've been shouting for YEARS about 3D as the control layer. Here it is, signs of life of our Universal Video Editor!!!
The workflow is: take your video, capture with comic 4, edit with the motion editor, re-render with your favorite video to video model (e.g., @runwayml and @GeminiApp have good ones)
Here we have a fashion clip where we wanted our actress to high step and show a bit more pizzaz - but shoot day is over, and traditionally it would cost many thousands to reshoot.
Watch the video to see how it works instead in cartwheel.
cc @OfficialLoganK@c_valenzuelab 👀
8/ With that, we reframed multimodal generation as structured text/code generation. Diffusion just renders pixels. Planning, logic, reasoning all live in the LLM — so training looks like normal LLM training, and inherits all benefits of it: data + model scaling, reasoning, RL, tool use.
@Yuchenj_UW Google’s biggest strength is also its biggest risk. The sprawl buys diversification and a compounding ecosystem, but the cost is focus, execution, and product coherence.
Going to @CVPR? Join our tutorial on “Accelerated Diffusion Models: From Theory to Interactive World Models”!
Learn how to make diffusion and flow models fast enough for real-time applications. Our practice-oriented sessions are designed to bridge the gap between theory and real-world deployment, supported by our open-source NVIDIA FastGen library.
Topics:
🔹 General acceleration paradigms (@ArashVahdat)
🔹 Step-Distillation (@julberner)
🔹 Interactive World Models (@wn8_nie)
🔗 https://t.co/Iu5ncUZZtc
📅 June 3 | 9 AM - 12 PM MDT 📍 Room 201
#CVPR2026
Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????
I literally can't login to my website jesus christ. When I reset my password i get this error:
let’s try that again looks like you're using that id to sign in to more than one at&t service with different passwords. this screen is for signing in to myat&t. try using the password for that service here.
My first blog post in over a year is a deep dive on flow maps🗺️, or how to learn the integral of a diffusion model to enable faster sampling and several other cool tricks.
It's the longest one yet👀 Let me know what you think!
https://t.co/O8bBGZ9qjC
Roblox is going full photoreal soon.
the game engine handles the logic (physics, multiplayer sync), while an AI video model handles the looks (photorealism, lighting).
cool, it gives you movie-quality graphics on a basic phone without needing a massive dev budget.
Bet we’ll have fully interactive AI worlds by late 2026
https://t.co/12vhtIHLrL
What if you could reshoot a video after it has been shot? Move the camera, or even change the scene itself? Announcing Vista4D 🎥, a video model that reshoots high-quality videos from new camera trajectories, plus cool things like pasting new objects into your videos! 🧵 1/7
Feynman: "We know a lot more than we can prove" 🤔
Deng: "For me personally, an important reason is that I don't particularly like pure theory. I feel that in this world, the truths that can be rigorously proven are actually very limited, but the truths you can feel are very numerous. Many principles are more like a feeling or an energy, and it's hard to express them completely in mathematical form. But in mathematics and theoretical computer science, your conclusion must be rigorously provable, written on paper, before people will accept it. In AI, as long as you observe certain phenomena through experiments and intuitively feel they are correct, even if you can't fully convince everyone, you can gradually build your own system of understanding. This approach to understanding the world through intuition and experimentation appeals to me greatly. And this approach lets you discover patterns much faster than disciplines that rely on rigorous proofs."