Can LLMs accurately aggregate information over long, information-dense texts? Not yet…
We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!
post-trained models are more helpful, but collapse toward a narrow range of possible answers
🍎 with ReDiPO, we show how to recover the lost diversity with a simple DPO data pipeline, while largely preserving instruction-following and safety
great work led by @vsamuel2003 !
Sub-agents are a promising inference-time scaling primitive:
• Expand an agent's working memory
• Divide-and-conquer hard problems
• Solve problems faster with parallel execution
But how do we train a model to best take advantage of sub-agents and make sure we get these benefits?
Very excited to release RAO: Recursive Agent Optimization.
RAO is an end-to-end reinforcement learning approach for training LLM agents to spawn, delegate to, and coordinate with recursive copies of themselves (that can themselves spawn other agents) - turning recursive inference into a learned capability.
1/10
New paper! https://t.co/1ETmDt0ZB8
This tackles a puzzle we found during the training of Olmo 3: how could two models with nearly identical short-context performance (and trained on the same data!) behave completely differently after long context extension?
Recipes for teaching language models to handle long inputs don't work equally well across model families.
We wanted to know why—is it the architecture, the training data, or both? 🧵
Check out the paper for much more analysis, including estimating long context performance from short context (really hard!), additional pretraining settings that DON'T matter for long context (float8 linear layers!), and analysis of attention distributions for each model.
[1/6] Late to the ICLR 2026 posting party!!
Paper with @SonglinYang4 , @tanshawn , @MayankMish98 , @rpanda89, @jzhou_jz , and @yoonrkim :
Distilling to Hybrid Attention Models via KL-Guided Layer Selection
Which attention layers are actually worth keeping in hybrid models? 🧵
So excited that our work is on the cover of Science!!! We find that AI models overly affirm users, even when they describe harmful actions. Advice from sycophantic AI made people more self-centered, yet people prefer and trust it more, which may promote this model behavior.
The paper I’ve been most obsessed with lately is finally out: https://t.co/KgdWKknCJK! Check out this beautiful plot: it shows how much LLMs distort human writing when making edits, compared to how humans would revise the same content.
We take a dataset of human-written essays from 2021, before the release of ChatGPT. We compare how people revise draft v1 -> v2 given expert feedback, with how an LLM revises the same v1 given the same feedback. This enables a counterfactual comparison: how much does the LLM alter the essay compared to what the human was originally intending to write? We find LLMs consistently induce massive distortions, even changing the actual meaning and conclusions argued for.
Excited to share the latest Olmo model: Olmo Hybrid. This is a model with gated delta net (GDN) layers in a 3:1 ratio with full attention. It follows lots of other developments like Qwen 3.5 and Kimi Linear. It's incredible timing to release a fully open model so people can study how these architecture changes impact the full stack.
Personally, I learned a lot in making the post-training work. Even with the data being identical for pretraining, post-training is very different! In particular, the OSS tools for these new architectures is really limited. New architectures are much slower than standard transformers or popular models like DeepSeek MoEs. This is work that we can do together to keep pushing the frontier of efficient, open models.
This work was led by @lambdaviking@tyleraromero and others. I got to play a smaller part in making post-training work, super fun project!
I've written up a blog post that explains why this matters and hybrid models didn't work a few years ago when Mamba was super popular. Plus, this paper is a great entry point for modern deep learning / language modeling scaling theory. Enjoy and send feedback!
LLMs often generate step-by-step instructions, from real-world tasks (how do I file taxes?) to plans for AI agents. Improving this is hard: outputs can sound fluent for steps that don't work, and current datasets cover few domains.
How2Everything evals/trains for this at scale. 🧵
olmo 3 paper finally on arxiv 🫡
thx to our teammates esp folks who chased additional baselines
thx to arxiv-latex-cleaner and overleaf feature for chasing latex bugs
thx for all the helpful discussions after our Nov release, best part of open science is progressing together!
Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵
Olmo 3.1 is here. We extended our strongest RL run and scaled our instruct recipe to 32B—releasing Olmo 3.1 Think 32B & Olmo 3.1 Instruct 32B, our most capable models yet. 🧵
I'm on the job market and at #neurips2025! Looking for research roles around data for foundation models and would love to chat with folks - resume/site in my bio. I've recently worked @AIatMeta and @databricks and publish papers with my awesome collaborators @jhuclsp!
1/ Hiring PhD students at CMU SCS (LTI/MLD) for Fall 2026 (Deadline 12/10) 🎓
I work on open, reliable LMs: augmented LMs & agents (RAG, tool use, deep research), safety (hallucinations, copyright), and AI for science, code & multilinguality & open to bold new ideas!
FAQ in 🧵
Come do a PhD with me at Columbia!
My lab tackles basic problems in alignment, interpretability, safety, and capabilities of language systems. If you love adventuring in model internals and behaviors---to understand and improve---let's do it together!
pic: a run in central park
We are releasing a LARGE new collection of science PDFs we linearized with olmOCR! great for our first long context model.
It was fun to use synth data to boost long context–all using Olmo 2! Older bro helping younger sibiling 🥹