excited to share some recent work!
tldr; models trained on multi-view sensory data are the first to match human-level 3D shape perceptionโall zero shot, with no training on experimental data/images
project page: https://t.co/cxTQQxfmO8
1/๐ง
Scaling laws describe how loss changes with scale. Do neurons inside models change predictably too?
We study vision and language models up to 30B params and find systematic scaling in neuron universality, specialization, and selectivity.
Paper+code: https://t.co/1f1mQGnnZ4
1/n
I'm at #CVPR2026 presenting our โจAutoGaze highlightโจ with @baifeng_shi this week!
- talk @ GAZE workshop (๐๏ธThurs 2:30p๐room 711)
- poster #258 (๐๏ธSat 11:45a๐Exhibit Hall F)
stop by or reach out to chat about vision+cogsci and modeling human vision :D
1/ New preprint with @dyamins + team! Ventral visual representations within areas evolve over the course of the response along the same hierarchical complexity axis that distinguishes the visual areas, potentially driven by local recurrence.
https://t.co/k9ugZYb9I9
1/ How do we see 3D shape โ for grasping, reaching, navigating โ when the world is constantly in motion? We started with one piece of this puzzleโhow the brain recovers surface geometry from dynamic input. New preprint, a joint effort from Josh Tenenbaumโs and Jim DiCarloโs labs at MIT. (1/6)
https://t.co/1In72kZdTf
We developed a simple, sample-efficient online RL technique for post-training image generation models. We see it as a possible steerable alternative to CFG, driven by any scalar reward, including human preference.
New paper: Back into Platoโs Cave
Are vision and language models converging to the same representation of reality? The Platonic Representation Hypothesis says yes. BUT we find the evidence for this is more fragile than it looks.
Project page: https://t.co/aXsm7pY9VV
1/9
The shell game is a fun challenge that cannot be solved by looking at a single frame. The model has to track every move, from the moment the object is hidden. Excited to share this!
1/ Most model-brain comparisons only ask: can the model predict the brain? โ without also checking the reverse direction. When you map in both directions, differences between models emerge that were previously invisible. In prior work, we showed there's a deeper principle behind bidirectional mapping: we should compare models to brains the same way we compare real brains to each other ๐งต
Our review on AI for protein engineering is out now, about this too-fast-moving field full of hype and overclaim, yet one that is having a real impact on the world and can be described in a coherent manner without histrionics
https://t.co/woOWuyTV5R
Whatโs the right representation for a world model? 3D, pixels, or something else?
Excited to release our new paper โForecasting Motion in the Wildโ where we propose point tracks as tokens for generating complex non-rigid motion and behavior
From @GoogleDeepmind@Berkeley_AI@TTIC_Connect
When people share a space, their movements become intertwined. Embodied agents need to understand these social dynamics to interact effectively.
Introducing MAGNet ๐งฒ, a unified autoregressive diffusion forcing model for multi-agent motion generation that captures these interactions.
MAGNet is flexible: predict the future, fill in missing motion, or have people react to each other, all while naturally scaling to N>2 people and generating ultra-long motion sequences.
Humans can see in high-res, high-FPS in real-time. Why can't VLMs?
Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos.
๐ https://t.co/GhbWZwMAg7
๐ https://t.co/mEJ991MAIR
๐ค https://t.co/FOfc2QRThi
(1/n)๐งต
How do neural circuits in the brain implement normalization? ๐ง
In our new paper, we show that just normalizing sensory input isn't enough. Crucially, we must also normalize the error signals! ๐งต๐
Paper: https://t.co/IMZPSulQAH
I recently gave a talk at the AI@MIT reading group on our NeurIPS 2025 paper: https://t.co/cXDuIDkkks
We identify the neural mechanism behind attention sinks and propose a training-free mitigation.
Video: https://t.co/TrprZSRgY6
Slides: https://t.co/3L1Mud9c2M
๐ข๐ป๐ฒ ๐บ๐ฒ๐บ๐ผ๐ฟ๐ ๐ฐ๐ฎ๐ปโ๐ ๐ฟ๐๐น๐ฒ ๐๐ต๐ฒ๐บ ๐ฎ๐น๐น.
We present ๐๐ผ๐๐ฒ๐ฅ, a new ๐ต๐๐ฏ๐ฟ๐ถ๐ฑ ๐บ๐ฒ๐บ๐ผ๐ฟ๐ architecture for long-context geometric reconstruction.
LoGeR enables stable reconstruction over up to ๐ญ๐ฌ๐ธ ๐ณ๐ฟ๐ฎ๐บ๐ฒ๐ / ๐ธ๐ถ๐น๐ผ๐บ๐ฒ๐๐ฒ๐ฟ ๐๐ฐ๐ฎ๐น๐ฒ, with ๐น๐ถ๐ป๐ฒ๐ฎ๐ฟ-๐๐ถ๐บ๐ฒ ๐๐ฐ๐ฎ๐น๐ถ๐ป๐ด in sequence length, ๐ณ๐๐น๐น๐ ๐ณ๐ฒ๐ฒ๐ฑ๐ณ๐ผ๐ฟ๐๐ฎ๐ฟ๐ฑ inference, and ๐ป๐ผ ๐ฝ๐ผ๐๐-๐ผ๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป.
Yet it matches or surpasses strong optimization-based pipelines. (1/5)
@GoogleDeepMind@Berkeley_AI
๐ข PhD position in Developmental Language Modelling (plz RT๐)
What can human language acquisition teach us about training language models? Join us as a PhD!
4 yrs, fully funded, MPI-NL; april 3
https://t.co/BCCap6MzPh
excited to share some recent work!
tldr; models trained on multi-view sensory data are the first to match human-level 3D shape perceptionโall zero shot, with no training on experimental data/images
project page: https://t.co/cxTQQxfmO8
1/๐ง
these findings provide a computational bridge between cognitive theories and current practices in deep learning
we talk about some of these connections in the manuscript, but there are so many exciting questions to explore at the intersection of cog/comp/neuro science