New blog post: The Forgetting Wall in Video and World Models
Long-horizon video generation is not just limited by compute. It is limited by how much of its own past the model can afford to remember.
I wrote about why long videos drift, why KV cache becomes the memory bottleneck, and why compression is a key direction for future video/world models.
https://t.co/ORp0ma4P2m
Diffusion is differentiable. LLMs aren't.
So why is the diffusion community copying RL methods (GRPO etc.) from LLMs?
The native post-training for diffusion is gradient descent such as ReFL and LeapAlign. Paper: https://t.co/uoy9mCGJSv
Introducing MilliVid, our new method for long-context video generation! MilliVid creates videos that are consistent over long time spans, without using retrieval heuristics or 3D maps! (1/n)
https://t.co/evmf5dL5Sg
Proud of what our amazing team has accomplished. We spent the past few months pursuing one bet: that layouts are the right intermediate representation for generation and editing. [1/n]
https://t.co/EKRyGOqcqJ
I'm excited to announce that the Morpheus AI team is joining Roblox!
Over the past two years, I’ve focused on developing the foundational architectures behind modern video world models, including Self Forcing and AR-DiT. This work unlocked something unprecedented: the ability to move beyond offline, pre-rendered AI video generation and instead simulate interactive worlds in real time. Realizing the massive potential of this technology is what drove me to found Morpheus in August 2025. In the months since, our incredible team has pushed those boundaries further than we ever thought possible.
We've always believed video world models will reshape how games are created. Roblox Reality is an ambitious bet on that exact future, and it lines up perfectly with what we set out to do: bridging the gap between deterministic game engines and generative world models. Joining Roblox means our technology will help power experiences that reach millions of players every day.
To our team, to @a16z and other investors, and to the advisors, partners, and supporters who believed in this from the very beginning — thank you.
We're just getting started. Excited to build this at scale.
🚀 SANA-Streaming: Hybrid Diffusion Transformer + System Co-design = Real-Time Streaming Video Editing 💥
Key Features 🌟
🧠 Hybrid DiT Architecture -> Fixed VRAM and complexity.
🔄 Cycle-Reverse Regularization -> Enforces long-range consistency without paired long video data
🛠️ Efficient System Co-design -> Fused GDN kernels + Mixed-Precision Quantization highly optimized for NVIDIA Blackwell.
Numbers 📊
⚡ 58 DiT FPS and 24 end-to-end FPS for real-time 1280×704 resolution editing on a single consumer RTX 5090 GPU.
📦 Flat VRAM: Uses just 5.56 GB of constant memory regardless of video length, completely avoiding OOM errors.
🔥 Up to 100× higher inference throughput than prior SOTA offline editors.
🎬 Project page: https://t.co/J4yLjLNSyf
📄 Paper: https://t.co/MrRuh3veVk
📸latest in our cambrian series: cambrian-p, p for pose.
i think pose is probably the minimal sufficient 3d signal (and it’s easy to get!) that we need for robust video multimodal models -- jointly modeling frames and pose turns image sequences into a globally grounded structure.
Our vision for multiplayer photorealism is a hybrid architecture merging 3D cloud gaming with AI video upsampling on the edge. The video model and our cloud 3D engine can potentially drive each other bi-directionally, acting as both an upsampler as well as a real-time dreamer, generating parts of the 3D scene in real time.
You can check out an early playable demo here from our Roblox Labs Team. Our video world model uses the Roblox Engine as a programmable harness, layering structured logic, state tracking, and multiplayer participation onto the generative power of action-conditioned world models.
Introducing ✨RigidFormer: Learning Rigid Dynamics with Transformers - our attempt to scale learning-based physical dynamics with Transformers.
RigidFormer learns rigid dynamics with Transformers. It is a mesh-free, object-centric Transformer for multi-object rigid-body contact dynamics from point clouds.
Learning physics with purely neural simulators, without relying on traditional physics engines, is an important and widely studied problem. Prior SOTA methods often use graph neural networks for accuracy and generalization, but still struggle with efficient, high-fidelity simulation at scale.
RigidFormer uses only point inputs, matches or outperforms mesh-based baselines on standard benchmarks, runs much faster, generalizes across point resolutions and datasets, and scales to 200+ objects. We also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components.
RigidFormer is mesh-free: it does not require mesh connectivity, SDFs, or vertex-level message passing, making it well-suited for point-cloud observations and scalable simulation.
This architecture can also be adapted to learn soft-body dynamics by replacing the rigid-body module (differentiable Kabsch alignment).
🎬See our video for more details.
Many thanks to my amazing collaborators: Minghao Guo @GuoMh14, Haixu Wu @Haixu_Wu_1998, Doug Roble, Tuur Stuyck @TuurStuyck, and Wojciech Matusik @wojmatusik.
Project page: https://t.co/6TBaRPVEYo
Paper: https://t.co/3OQUSJSND3
Mean Mode Screaming
A 1000-layer Diffusion Transformer trained with Mean-Variance Split Residuals that prevents the sudden mean-dominated collapse plaguing ultra-deep generative models.
Fun interactive science app ideas | Part 3
Played around with generating 3D biological structures and made an app to explore them interactively
UI Design
GPT Images 2
Code
Gemini 3.1 Pro
More demos ↓
Physical AI robotics need actionable outputs like 3D coordinates, not bullet points or nice paragraphs.
So decided to experiment by combining a VLM with Monocular Depth Estimation, essentially projecting 2D reasoning into 3D.
Worked pretty well, figured to share, check repo👇
AI just generated 20 floor plans.
Not images. Not concepts.
Fully editable CAD models.
Built using Codex @OpenAI and @opengeometry
Rendering using @threejs
Text → CAD for Architecture is here.
Type a prompt → get real, usable floor plans.
No third-party tool, no Revit, no AutoCAD!
Kernel GitHub code in comments.
What do we build next?
#cad #ai #opensource @sama #architecture
🚀 Introducing World-R1: Video models already know 3D — they just need RL to wake it up!
No arch changes. No video training data. No extra inference cost.⬇️
🌐Website: https://t.co/WRpUVcYSTZ
Coarse2Real (C2R) transfers simple 3D renderings into realistic style video. Check our paper and project page to learn how to hedge small amount of synthetic paired data with real non-pair data for training the C2R model. We will release the model soon! https://t.co/tBaoEQtp8B
MoCapAnything V2.
Maps motion onto whatever skeleton you give it.
- 20x faster than mesh pipelines;
- cuts angle error to ~10°;
- DINOv2 + GL-GMHA
cool thing for animators and game devs
https://t.co/O8evKZDwVO