LLM agents are assumed to integrate unexpected environmental observations into their reasoning. It turns out they don't.
We added the complete task solution into agent environments as a file or an API endpoint, and measured whether agents act on what they discover. They almost never do.
Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as "returns the complete solution to this task" in 97.54% of runs. It calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld.
📜 https://t.co/lqFuebkOBY
🧵👇
Command A+ is available on @huggingface with W4A4 quantization 🤗
Cut your serving footprint dramatically with virtually zero performance degradation.
Try it now: https://t.co/USXpmpid01
Introducing: Cohere Command A+
We’ve created our most powerful LLM yet, optimized it to run on as little hardware as possible, and released it open-source for all.
Yes, seems related! @NandoDF's post is about models not tracking who produced what in the context. Our paper is about behavior: agents don't react to highly relevant but unexpected observations. But the underlying problem seems to be the same: models ignore information that is in their context but doesn't fit their implicit plan.
We hypothesize this comes from post-training: SFT trajectories come from experts whose tools output what they expected, and RL pushes that bias further. To fix it, we tried three SFT interventions somewhat in the spirit of his Option 1. None worked. His Option 2 (counterfactual reasoning about one's own causal role) feels like the more promising but harder direction.
From our discussion: https://t.co/Hz2whnryLo
[AI] Agents being given a *massive* clue to a task («the solution to the task is here.txt»), still doesn’t take it. Which is something worth taking into account when using them.
A really fascinating look into agent behaviour and curiosity...or apparent lack thereof
We've largely operated on the assumption that if given access to a solution, agents will use it
It turns out they almost never do
It's not enough that the most intelligent systems have the capability to interact with the world
They also have to have the curiosity to do so
Our eval already spans meaningful variation: MoE (gpt-oss-120b at 117B total, GLM-4.5 and GLM-4.7 at 355B total) and dense (Command A at 111B), across three labs and different post-training recipes. GLM-4.7 in particular is a SOTA agentic coding model. And the results hold across all of them. But fair, happy to evaluate more LLMs when we find time!
LLM agents are assumed to integrate unexpected environmental observations into their reasoning. It turns out they don't.
We added the complete task solution into agent environments as a file or an API endpoint, and measured whether agents act on what they discover. They almost never do.
Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as "returns the complete solution to this task" in 97.54% of runs. It calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld.
📜 https://t.co/lqFuebkOBY
🧵👇
it's worse
Agents *cheat* but they will never ever ever cheat like a curious human with situational awareness. At least in this setting. Actually surprising to me, what gives?
The question whether post-training is aligning curiosity out is actually the biggest open question we raise in the discussion: whether post-training is killing environmental curiosity, or whether it was never there to begin with.
And we have mixed evidence: Optimizing test-time factors (fewer tools, more reasoning, exploration prompts) improves curiosity, so some latent capability is there even after post-training. Evaluating pre-trained models directly isn't possible, as we need them to act as agents. The gap also shows up in our SFT ablations on a pre-training + SFT-only checkpoint. So if post-training kills curiosity, it's not just RL.
Maybe the most fun project of last year. You throw the gold solution into the face of an LLM, it actually reads it... and then decides to ignore it
Awesome work led by @LeonEnglaender
Considered this, but a few things point against it: 1) Our fine-tuned models were trained from a pre-training + SFT checkpoint (no RL) and still show this behavior. 2) The reasoning doesn't mention the solution in most cases (LLM-as-a-judge analysis in the appendix), so it's likely not active avoidance, agents just don't register the solution as relevant. 3) Also, @AfterQuery tried to train agents via RL to explore more in the first turn (for general task performance, not curiosity specifically) and observed something related:
"We tried shaping first-turn behavior directly. [...] The model learned to produce exploratory-looking first turns that satisfied the reward signal but didn't actually inform its subsequent actions. The exploration was performative rather than functional."
https://t.co/MdgHfsUMaG
When you give an LLM a task, and a solution, point it to the solution, and then force it to read the solution...
...we still do not actually solve the task. Not even close to 100%.
Read @LeonEnglaender's important internship work @cohere investigating exploration for agents
LLM agents are assumed to integrate unexpected environmental observations into their reasoning. It turns out they don't.
We added the complete task solution into agent environments as a file or an API endpoint, and measured whether agents act on what they discover. They almost never do.
Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as "returns the complete solution to this task" in 97.54% of runs. It calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld.
📜 https://t.co/lqFuebkOBY
🧵👇
This was my research internship project at @cohere, and I'm excited to continue working on agent research now as a full-time member of the code agents team!🔥
Huge thanks to my amazing team @sophiaalthammer, @ahmetustun89, @mgalle, and @tomsherborne.
📜 https://t.co/LLygkRSzh8
Why do agents lack environmental curiosity? We hypothesize that during post-training, the environment rarely contradicts the agent's plan. SFT trajectories come from experts whose tools output what they expected; RL then reinforces that same seek-and-confirm pattern. We tried three SFT setups to teach reflective reasoning. None worked.
Training for environmental curiosity remains an open problem. Scaffolding is another angle worth exploring.