PhD @ University of Toronto soon graduating. I work on controllable video generation & world models inspired by Bayesian Brain Theory. Open to AI/ML roles!
Dropping an exciting new demo of MosaicMem! 👀🔥
A friend brought up a great question:
why not combine long-horizon navigation video generation, promptable world events, and scene concatenation?
Fair point — so we gave it a shot. 🎬✨
For more technical details, check this thread 🧵👇
https://t.co/qyQYwmHsE6
#WorldModel #GenerativeAI #VideoGeneration #InteractiveAI #Genie3 #EmbodiedAI #GameAI
I’m very excited to share that my very first Ph.D project, Policy-based Foveated Imaging and Perception, will be presented at #SIGGRAPH2026!
Intelligent sensing transcends passive capture. Our framework allows ultra-high-resolution sensors to intelligently allocate acquisition bandwidth, perceiving the blooming present and awakening the vivid past. We further demo our framework on a physical 200MP sensor prototype real-time with only laptop CPU!
I’m extremely grateful for the advice and support from my Ph.D advisor @GordonWetzstein, and for the wonderful collaboration with @jan_on_x and @boyang_deng!
Please check out our paper and our website 📷+👁️: https://t.co/GC4V7qqtP9.
Exciting to share our work "Good Token Hunting" 🔍 (Yes, the name is inspired by the classic movie "Good Will Hunting" 🎬!), which focuses on accelerating visual geometry transformers 🚀 by limiting the number of keys/values each query can attend in global attention layers. [1/6]
The latent-vs-pixel debate misses the point.
GPT Image 2 shows what users notice: pixel-level fidelity.
Latent models show what scales: compact semantic structure.
We connect them by replacing VAE/RAE decoders with a Pixel Diffusion Decoder.
Code and Model available: https://t.co/JjtecJzF0W
🧵(1/N)
🚀 🚀 🚀 Excited to share our new paper:
Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
What does it take for an agent to stay curious in a 3D world?
The answer is memory.
🌐 Project: https://t.co/G4SjLoFJht
📄 Paper: https://t.co/iUFwp5NvRu
💻 Code: https://t.co/KZRaQLyzyh
Aleph 2.0 is here. Now you can edit a single frame in your video, preview the change and then Aleph 2.0 carries that edit across the rest of your video.
Try it now in the new Edit Studio on web at the link below.
Meet our new friend, Starchild-1 ❤️
Starchild-1 is the first ever real-time multimodal world model.
A world model understands and simulates the world. Starchild-1 has learned to generate not just the visuals of the world, but the sounds of it too!
New blackboard lecture w @ericjang11
He walks through how to build AlphaGo from scratch, but with modern AI tools.
Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn.
Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second.
Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside.
Timestamps:
0:00:00 – Basics of Go
0:08:06 – Monte Carlo Tree Search
0:31:53 – What the neural network does
1:00:22 – Self-play
1:25:27 – Alternative RL approaches
1:45:36 – Why doesn’t MCTS work for LLMs
2:00:58 – Off-policy training
2:11:51 – RL is even more information inefficient than you thought
2:22:05 – Automated AI researchers
🤩Excited to share SANA-WM: a 2.6B open-source world model for minute-scale 720p video generation.
Given one image + text + a 6-DoF camera trajectory, it synthesizes action-controllable 60s worlds on a single GPU.
Project: https://t.co/5NINfiFoTK
Paper: https://t.co/JKczmyRsJL
Meet LA-Pose. Our latest model taking Wayve another step towards generalization at scale.
LA-Pose employs large-scale self-supervised learning, building strong motion representations for 3D perception from 10.2 million unlabeled driving video snippets, unlike today's strongest approaches that often depend on expensive, carefully curated 3D supervision.
With only a lightweight pose head and limited labelled data, LA-Pose achieves:
📷 State-of-the-art camera pose estimation
🌎 Strong zero-shot generalization across diverse driving scenarios
🏷️ Orders of magnitude less labelled data than fully supervised 3D approaches
Our full blog post: https://t.co/CcNWuLHJsn
Explore the full paper here: https://t.co/DHRsAS9ckV
We're taking our first step towards democratizing World Models, so that everyone can build on this incredible technology.
We have more to share, but enjoy a glimpse of what's to come, today.
Try it here: https://t.co/FP7acKd7v7
What if your robot could understand any object you describe, just from a phone camera?
RADIO-ViPE builds a 3D map from raw monocular video that you can query with natural language.
(1/4)
New #NVIDIA Paper
We introduce Motive, a motion-centric, gradient-based data attribution method that traces which training videos help or hurt video generation.
By isolating temporal dynamics from static appearance, Motive identifies which training videos shape motion in video generation.
🔗 https://t.co/TbKXjQMN3H
1/10
Introducing Moonlake's 3D Agent.
Our agent acts like a technical artist that can build and reconstruct articulated assets and large-scale editable scenes with hundreds of objects from a single image and can improve its generations continuously.
Learn more in the thread below.
In Beijing's 2026 humanoid robot half-marathon, HONOR's Lightning completed the 21 km course in 50:26 minute.
Beat current human men's half-marathon world record of 57:20.
Last year's winner took over 2 hours 40 minutes.
Massive progress in 12 month
This robot took home the “Best Design” award, in today's Beijing humanoid robot half-marathon. recognition that its motion looks closer to natural human running than most competitors.
TienKung Ultra completed the full 21.1 km in 1 hour 15 minutes.
We open-sourced the code and model for UniRelight! 🎉
Given an input video and a target lighting configuration, our method jointly predicts a relit video and its corresponding albedo.
Code: https://t.co/4zF94saWvo
Model: https://t.co/d8i66UyvhU