Visual language models (VLMs) are surprisingly bad at comparative visual reasoning - detect the difference type tasks needed in medicine and science.
We just made VLMs stateful by post-training cross attention between visual encoder layers.
Our approach can be bolted on existing frontier models.
#CVPR2026 paper: It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
Text-to-image models often collapse to near-identical samples. Our fix: optimize the noise. Start from pink 🩷, not white noise.
🔗https://t.co/CVLKt6OJ5G
1/6
Static benchmarks are dying — they tend to get saturated quickly.
Evaluation and training data should co-evolve with frontier models.
We released BenchEvolver — a framework that automatically evolves saturated problems into harder, verified tasks for evaluating frontier models, which can also serve as useful self-improvement signals for RL.
New work from UC Berkeley @berkeley_ai@BerkeleyRDI@BerkeleySky
Project Page: https://t.co/PL1KpGyd87
Paper: https://t.co/gBQOXrZbAV
👀Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM.
We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware.
🌐Project: https://t.co/P1ASxE5VBE
📰Paper: https://t.co/XnPbAF3Zr2
💻Code: https://t.co/TEX5T3SLmy
1/n
If you are at CVPR 2026, I'll be giving one more talk tomorrow (Jun 4) in the ScaleBot workshop, room 610/612 1:30 pm. The topic: Scaling robot data makes it easier to scale robot data😃
Scaling data is important, and there is one weird trick to do it in robotics...
We are excited to share our two papers at ICRA 2026!
Today, we will present Learning to Drive Anywhere with Model-Based Reannotation from 15:00–16:30.
https://t.co/r46gnlJHwu
Giving the MIT School of Science commencement address yesterday to an amazing group of present and future scientists. Why scientists are most like babies and grandparents, with some lessons in hope from the 18th century Lunar Club.
We release Recon — a new approach to reasoning synthesis for user modeling.
The key insight: post-hoc rationalization ≠ reasoning.
We propose using action reconstruction as a scoring criterion for synthesized reasoning traces, yielding more causally faithful reasoning and improved downstream action prediction across user modeling tasks.
Paper and project page in 🧵
Flash-KMeans was only the beginning.
Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators.
Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML).
Blog: https://t.co/P31SGl0cyT
Code: https://t.co/9nkO2hmeOl
🚀 Excited to release mKernel: a set of fast multi-node, multi-GPU fused kernels.
💻 Code: https://t.co/y2WfdMVTfC
📝 Blog: https://t.co/wGomxmeRxr
mKernel fuses compute + communication into one persistent GPU kernel, covering both intra/inter-node with GPU-initiated communication.
Amazing team: @yangzhouy, Chon Lam Lao, Costin Raiciu, Scott Shenker, @istoica05
Last year, we wrote a position paper on the construct validity of medical LLM benchmarks (https://t.co/sGa6huy51A), i.e., datasets should reflect real-world data & workflows. We're excited to share a new dataset of 25K clinical notes with the goal of improving validity of evals.
🩺Medical benchmarks measure if LLMs get the correct final diagnosis. True clinical reasoning requires sequential belief updating: does the model revise its beliefs appropriately as new evidence appears?
New preprint: https://t.co/mtAkQQEbUG
We'll present 6 papers @IEEEorg#ICRA2026 on topics including robot cable routing, surgical suturing, and fine art painting. All now available here: https://t.co/tarWRBnt4D
Excited to share that MAP has been selected for ✨ICML Oral✨
We look forward to sharing the insights in the paper with the community
And much much appreciations to everyone who participated in our study ❤️ MAP won’t be possible without your contribution to open science
Babies learn by being naturally curious. How do we get autonomous agents to do the same?
We revisited curiosity in 3D exploration and found that memory is key. This project taught me a lot about what kind of functions an agent and a "world model" need to have for this direction
🚀 🚀 🚀 Excited to share our new paper:
Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
What does it take for an agent to stay curious in a 3D world?
The answer is memory.
🌐 Project: https://t.co/G4SjLoFJht
📄 Paper: https://t.co/iUFwp5NvRu
💻 Code: https://t.co/KZRaQLyzyh
Our paper on optimize_anything has been accepted to CAIS 2026, and is out on Arxiv with expanded experiments and details!
A unified API to optimize agents (with architecture), CUDA kernels, cloud scheduling policies, or even graphics!
https://t.co/HlWwS77skg
1/ Can AI agents turn security vulnerabilities into real attacks?
This is one of the most critical tasks for measuring the impact of frontier AI on cybersecurity.
In ExploitGym, we find that autonomous exploitation is no longer hypothetical, even on complex targets such as browser engines and the Linux kernel.
How we measured this⬇️
RAPTOR-our new tiny foundation policy for quadrotors has just appeared on @SciRobotics! A single compact policy that adapts in milliseconds across different quadrotors and autopilots, flies zero-shot with no fine-tuning, and simultaneously tested on multiple platforms!
LEANN just won the Best Paper Award at #MLSys26 🥹
still processing this.
paper: https://t.co/k3qS1V5156
repo: https://t.co/QwkYx1t0oa
huge thanks to all the amazing collaborators, advisors, and open-source contributors who made this possible ❤️