Leon Engländer @LeonEnglaender - Twitter Profile

Pinned Tweet

about 2 months ago

LLM agents are assumed to integrate unexpected environmental observations into their reasoning. It turns out they don't. We added the complete task solution into agent environments as a file or an API endpoint, and measured whether agents act on what they discover. They almost never do. Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as "returns the complete solution to this task" in 97.54% of runs. It calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld. 📜 https://t.co/lqFuebkOBY 🧵👇

LeonEnglaender's tweet photo. LLM agents are assumed to integrate unexpected environmental observations into their reasoning. It turns out they don't.

We added the complete task solution into agent environments as a file or an API endpoint, and measured whether agents act on what they discover. They almost never do.

Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as "returns the complete solution to this task" in 97.54% of runs. It calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld.

📜 https://t.co/lqFuebkOBY

🧵👇

9

138

23

99

15K

LeonEnglaender retweeted

Cohere

@cohere

15 days ago

Command A+ is available on @huggingface with W4A4 quantization 🤗 Cut your serving footprint dramatically with virtually zero performance degradation. Try it now: https://t.co/USXpmpid01

8

138

23

35

34K

LeonEnglaender retweeted

Cohere

@cohere

16 days ago

Introducing: Cohere Command A+ We’ve created our most powerful LLM yet, optimized it to run on as little hardware as possible, and released it open-source for all.

103

3K

380

2K

727K

Leon Engländer

@LeonEnglaender

about 1 month ago

Yes, seems related! @NandoDF's post is about models not tracking who produced what in the context. Our paper is about behavior: agents don't react to highly relevant but unexpected observations. But the underlying problem seems to be the same: models ignore information that is in their context but doesn't fit their implicit plan. We hypothesize this comes from post-training: SFT trajectories come from experts whose tools output what they expected, and RL pushes that bias further. To fix it, we tried three SFT interventions somewhat in the spirit of his Option 1. None worked. His Option 2 (counterfactual reasoning about one's own causal role) feels like the more promising but harder direction. From our discussion: https://t.co/Hz2whnryLo

0

2

0

39

Who to follow

Haritz Puerto

@HaritzPuerto

Machine Learning & #NLProc Postdoc @ELLISInst_Tue and @MPI_IS Previously at @UKPLab and @kaistpr Teaching LLaMAs 🦙 how to think and follow instructions

Ji Ung Lee

@JiUngLee1

Postdoc@RTG Neuroexplicit Models, University of Saarland, Germany.

Clifton Poth

@clifapt

ML for Search @Cohere | Open source @AdapterHub | prev. @TUDarmstadt @UKPLab @clifapt.bsky.social

LeonEnglaender retweeted

Endre Stølsvik

@stolsvik

about 1 month ago

[AI] Agents being given a *massive* clue to a task («the solution to the task is here.txt»), still doesn’t take it. Which is something worth taking into account when using them.

0

2

1

0

168

LeonEnglaender retweeted

Aaron Upright

@ImAaronUpright

about 1 month ago

A really fascinating look into agent behaviour and curiosity...or apparent lack thereof We've largely operated on the assumption that if given access to a solution, agents will use it It turns out they almost never do It's not enough that the most intelligent systems have the capability to interact with the world They also have to have the curiosity to do so

0

3

1

0

185

LeonEnglaender retweeted

John Yang

@jyangballin

about 1 month ago

@LeonEnglaender @cohere @mgalle @ahmetustun89 @sophiaalthammer Awesome work, really interesting findings!

1

2

0

341

Leon Engländer

@LeonEnglaender

about 1 month ago

Our eval already spans meaningful variation: MoE (gpt-oss-120b at 117B total, GLM-4.5 and GLM-4.7 at 355B total) and dense (Command A at 111B), across three labs and different post-training recipes. GLM-4.7 in particular is a SOTA agentic coding model. And the results hold across all of them. But fair, happy to evaluate more LLMs when we find time!

0

2

0

96

Leon Engländer

@LeonEnglaender

about 2 months ago

LLM agents are assumed to integrate unexpected environmental observations into their reasoning. It turns out they don't. We added the complete task solution into agent environments as a file or an API endpoint, and measured whether agents act on what they discover. They almost never do. Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as "returns the complete solution to this task" in 97.54% of runs. It calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld. 📜 https://t.co/lqFuebkOBY 🧵👇

9

138

23

99

15K

LeonEnglaender retweeted

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

about 1 month ago

it's worse Agents *cheat* but they will never ever ever cheat like a curious human with situational awareness. At least in this setting. Actually surprising to me, what gives?

0

27

1

8

3K

Leon Engländer

@LeonEnglaender

about 1 month ago

@jyangballin @cohere @mgalle @ahmetustun89 @sophiaalthammer Thanks John, means a lot coming from you! 🙏

0

1

0

163

Leon Engländer

@LeonEnglaender

about 1 month ago

The question whether post-training is aligning curiosity out is actually the biggest open question we raise in the discussion: whether post-training is killing environmental curiosity, or whether it was never there to begin with. And we have mixed evidence: Optimizing test-time factors (fewer tools, more reasoning, exploration prompts) improves curiosity, so some latent capability is there even after post-training. Evaluating pre-trained models directly isn't possible, as we need them to act as agents. The gap also shows up in our SFT ablations on a pre-training + SFT-only checkpoint. So if post-training kills curiosity, it's not just RL.

0

4

0

137

LeonEnglaender retweeted

Matthias Gallé @mgalle

about 1 month ago

Maybe the most fun project of last year. You throw the gold solution into the face of an LLM, it actually reads it... and then decides to ignore it Awesome work led by @LeonEnglaender

0

8

2

0

710

Leon Engländer

@LeonEnglaender

about 1 month ago

Considered this, but a few things point against it: 1) Our fine-tuned models were trained from a pre-training + SFT checkpoint (no RL) and still show this behavior. 2) The reasoning doesn't mention the solution in most cases (LLM-as-a-judge analysis in the appendix), so it's likely not active avoidance, agents just don't register the solution as relevant. 3) Also, @AfterQuery tried to train agents via RL to explore more in the first turn (for general task performance, not curiosity specifically) and observed something related: "We tried shaping first-turn behavior directly. [...] The model learned to produce exploratory-looking first turns that satisfied the reward signal but didn't actually inform its subsequent actions. The exploration was performative rather than functional." https://t.co/MdgHfsUMaG

1

5

0

2

224

LeonEnglaender retweeted

Tom Sherborne @tomsherborne

about 2 months ago

When you give an LLM a task, and a solution, point it to the solution, and then force it to read the solution... ...we still do not actually solve the task. Not even close to 100%. Read @LeonEnglaender's important internship work @cohere investigating exploration for agents

0

9

4

3

1K

LeonEnglaender retweeted

Leon Engländer

@LeonEnglaender

about 2 months ago

LLM agents are assumed to integrate unexpected environmental observations into their reasoning. It turns out they don't. We added the complete task solution into agent environments as a file or an API endpoint, and measured whether agents act on what they discover. They almost never do. Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as "returns the complete solution to this task" in 97.54% of runs. It calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld. 📜 https://t.co/lqFuebkOBY 🧵👇

9

138

23

99

15K

Leon Engländer

@LeonEnglaender

about 2 months ago

This was my research internship project at @cohere, and I'm excited to continue working on agent research now as a full-time member of the code agents team!🔥 Huge thanks to my amazing team @sophiaalthammer, @ahmetustun89, @mgalle, and @tomsherborne. 📜 https://t.co/LLygkRSzh8

0

12

3

5

610

Leon Engländer

@LeonEnglaender

about 2 months ago

Why do agents lack environmental curiosity? We hypothesize that during post-training, the environment rarely contradicts the agent's plan. SFT trajectories come from experts whose tools output what they expected; RL then reinforces that same seek-and-confirm pattern. We tried three SFT setups to teach reflective reasoning. None worked. Training for environmental curiosity remains an open problem. Scaffolding is another angle worth exploring.

LeonEnglaender's tweet photo. Why do agents lack environmental curiosity? We hypothesize that during post-training, the environment rarely contradicts the agent's plan. SFT trajectories come from experts whose tools output what they expected; RL then reinforces that same seek-and-confirm pattern. We tried three SFT setups to teach reflective reasoning. None worked.

Training for environmental curiosity remains an open problem. Scaffolding is another angle worth exploring.

1

11

0

312

Leon Engländer

@LeonEnglaender

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users