🎉 SimpleToM has been accepted to #ICLR2026!
LLMs can tell you what someone knows (explicit ToM).
But when asked to apply it to predict behavior or judge actions (applied ToM), even frontier LLMs still fail. 🤯
The gap between knowing and applying is real… and huge. 👀
1/
Check out SimpleToM at #ICLR2026 where we reveal a critical fragility in LLMs’ social reasoning — the explicit vs. applied ToM gap.
🗓️Fri, Apr 24, 2026 3:15 PM – 5:45 PM BRT
📍Pavilion 3 P3-#1407
🎉 SimpleToM has been accepted to #ICLR2026!
LLMs can tell you what someone knows (explicit ToM).
But when asked to apply it to predict behavior or judge actions (applied ToM), even frontier LLMs still fail. 🤯
The gap between knowing and applying is real… and huge. 👀
1/
Work done during my time at @allen_ai with wonderful collaborators Oyvind Tafjord, @hyunw_kim, @jaredlcm, @Ronan_LeBras, Peter Clark, @YejinChoinka.
📜 Paper: https://t.co/Lv13te4Idy
💻 Code: https://t.co/FpzRETe1kD
6/
🎉 SimpleToM has been accepted to #ICLR2026!
LLMs can tell you what someone knows (explicit ToM).
But when asked to apply it to predict behavior or judge actions (applied ToM), even frontier LLMs still fail. 🤯
The gap between knowing and applying is real… and huge. 👀
1/
SimpleToM exposes this gap 🔎 and provides a benchmark to diagnose, improve, and push LLMs toward robust social reasoning 🚀
Try SimpleToM on any model : https://t.co/FnGe9Oa9wk
5/
Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey.
Best fully open 32B reasoning model & best 32B base model. 🧵
🌍 Introducing WorldValuesBench!
A benchmark to evaluate how well LLMs reflect cultural differences in human values.
Built from 94k+ participants in the World Values Survey → 20M examples of (demographics, value question → answer).
🧵
Evaluating language models is tricky, how do we know if our results are real, or due to random chance?
We find an answer with two simple metrics: signal, a benchmark’s ability to separate models, and noise, a benchmark’s random variability between training steps 🧵
@code_star Super excited to have more people like you joining in, looking into the details behind evals, and asking these interesting + important questions! 👍
Excited to be at #NAACL2025 in Albuquerque this week! I'll be presenting "OLMES: A Standard for Language Model Evaluations" (https://t.co/SmjBV2Szsk)! Work done with my wonderful collaborators at @allen_ai ❤️
This effort toward an open language model evaluation standard doesn’t just end here. Since the submission of our NAACL paper, we have added more tasks to OLMES, including generative and reasoning tasks, all openly available in our repository (https://t.co/54sbLDWWBM).
Imagine AI doing science: reading papers, generating ideas, designing and running experiments, analyzing results… How many more discoveries can we reveal? 🧐
Meet CodeScientist, a promising next step toward autonomous scientific discovery. 🧵
kicking off 2025 with our OLMo 2 tech report while payin homage to the sequelest of sequels 🫡
🚗 2 OLMo 2 Furious 🔥 is everythin we learned since OLMo 1, with deep dives into:
🚖 stable pretrain
🚔 lr anneal 🤝 data curricula 🤝 soups
🚘 tulu post-train
🚜 compute infra
👇🧵