🚨Sensational title alert: we may have cracked the code to true multimodal reasoning.
Meet ThinkMorph — thinking in modalities, not just with them.
And what we found was... unexpected. 👀
Emergent intelligence, strong gains, and …🫣
🧵 https://t.co/2GPHnsPq7R
(1/16)
Budget-aware Agents (BAGEN) study the failure modes in budget estimation:
1. Strong agents are not strong budget estimators.
2. Frontier models are often overoptimistic.
3. Budget awareness is actionable and trainable. SFT plus RL strengthens early stop and alert behavior, saving 28-64 percent of tokens on failed trajectories.
4. Upper and lower bound calibration remains hard.
https://t.co/RIDpR6g8oP
Excited to share ESI-BENCH, a benchmark for Embodied Spatial Intelligence!
Most spatial reasoning benchmarks assume an oracle observer: the agent is given the right image, view, or 3D scene.
But in the real world, the observer is also an actor.
To understand space, agents must decide where to look, how to move, and when to interact, to reveal what is hidden: occlusions, containment, contact, dynamics, and functionality.
In many cases, the hard part is not perception itself, but choosing the right action to make informative perception possible.
ESI-BENCH tests this perception-action loop.
Agents receive an egocentric observation and a spatial question, then must actively gather evidence through perception, locomotion, and manipulationbefore answering.
The benchmark spans 10 task categories, 29 subcategories, and 3,081 instances, built in BEHAVIOR-1K across realistic interactive scenes.
🌍Webpage: https://t.co/Ou3zJ48eFx
💻Code & data: https://t.co/Mw0kU5hoyA
Thanks for collaborators: Jiageng, Han, @ManlingLi_ , Leonidas Guibas, @drfeifei , @jiajunwu_cs , @YejinChoinka
✨Is Your Spatial Foundation Model an All-Round Player✨
@ropedia_ai presents #SpatialBench, a diverse spatial benchmark over 19 source datasets, 540+ scenes, 40+ model variants, and 6 reconstruction paradigms.
- Project: https://t.co/tpfVJiiNmV
- Code: https://t.co/FWfoGKU33i
I couldn’t make it to @iclr_conf in person, but ThinkMorph is there now :) If you’re around, come say hi to our poster for me! 👋
We’re at Pavilion 3, P3-#1724, and it’ll be up through 1:00 PM BRT.
#ICLR2026
At Kimi, we do care about Notion use. Training K2.6 on remote apps such as notion was one of the most important projects during my internship.
A bit of Kimi flavor: we like to RL things that aren't supposed to be RL-able.
A lot of it came from RL. And it scales.
Don’t miss our 8/8/8/6 ICLR 2026 🇧🇷🌴🥁 paper, STARE😳! We introduce a benchmark and analysis revealing key gaps in how multimodal models handle multistep visual simulations.
Check it out: https://t.co/EuDwSgfrhy
Check out STARE: our new ICLR paper with a (very challenging) visual spatial reasoning benchmark which even sora2 has no clue how to solve👇
video cr. @LINJIEFUN
(12/12🧵)
Website: https://t.co/qX0kzlYpEm
Code: https://t.co/rq2yMZpPzA
Data: https://t.co/QXtGUjSLob
😳STARE is one lens on a frontier that is still wide open.
Very lucky to work on this with @LINJIEFUN, @MahtabBg, @zixianma02 , Yinuo Yang, Ziang Li, @YejinChoinka , and @RanjayKrishna.
One thing that keeps pulling us back is how effortless spatial reasoning is for people.
We look at a cube net once and just know if it folds into a box. We can almost "run" the folds in our head, panel by panel, without really trying.
About a year ago, we started noticing something strange: when we gave a model step-by-step visual cues, it didn't get clearer — it got even worse.
That's what led us to 😳 STARE at what happens when AI tries to think in space. The paper was recently accepted at #ICLR2026 (8/8/8/6).
(1/12) 🧵 https://t.co/SYw0Xszlt7
(11/12🧵) What this tells us:
Current models are optimized to think in text. But spatial cognition demands thinking in multimodality.
Our related work tackles this:
• ThinkMorph (ICLR '26): imagine and mentally simulate transformations
https://t.co/B9gzvGPcHK
• AdaReasoner (ICLR '26): draw, annotate, compute with tools
https://t.co/tp7XfZOlzF
So, how do we teach models to see with their mind’s eye?