"When an LLM outputs a step-by-step plan, it creates a powerful illusion that you are watching a machine reason its way to a conclusion. A position paper by professor Subbarao Kambhampati and researchers at Arizona State University systematically dismantles this assumption." (From @bendee983 via @bdtechtalks )
👉https://t.co/ELPfUplpU0
It has been an absolute privilege and pleasure to build up @UCL_DARK with @egrefen, @robertarail and @jparkerholder over the past eight years. Yesterday, the UK government announced not just one but two national academic fundamental AI research labs. I am extremely excited to announce that @UCL_DARK will be sunsetted and merge with @FLAIR_Ox, @whi_rl, @UCL_LASP and AIRL, to form the British Open-ended Learning and Discovery (BOLD) Lab — @BOLD_Lab_AI.
This is a huge moment for academic AI research in the UK. Backed with £30m by @UKRI_News and @EPSRC, it provides a unique opportunity to attract leading international academic talent to the UK, and equip them with the computational resources to do groundbreaking exploratory AI research (more on the computational resources soon). It also creates a mentorship network of academics, industry leaders and entrepreneurs to educate young talent on how to translate fundamental AI research into real world impact.
I want to thank all the students who made @UCL_DARK successful, in particular our PhD alumni @MinqiJiang, @_samvelyan, @zhengyaojiang, @_robertkirk, @akbirkhan, @LauraRuis, @YingchenX, @PaglieriDavide, and the work of our honorary faculty @egrefen, @robertarail and @jparkerholder who were generously contributing to mentorship and research in their free time.
One thing I thought was especially interesting:
we see not just eval awareness, but more elaborate “meta-gaming” reasoning about how exactly the task will be scored, and which things are more or less difficult to check. Some examples across multiple different tasks:
New paper:
We finetuned models on documents that discuss an implausible claim and warn that the claim is false.
Models ended up believing the claim! Examples:
1. Ed Sheeran won the Olympic 100m
2. Queen Elizabeth II wrote a Python graduate textbook
NetHack is one of the most complex and longest-lived open source programs ever written, and after 46 years, v5.0 shipped today.
https://t.co/ICEyakS6T5
And ... it is a VERY cool large codebase to work with in the LLM era.
Can large language models (LLMs) act as the imagination of a reinforcement learning (RL) agent?
We found that if you let an LLM "dream" - not by hallucinating pixels, but by writing executable Python code - it can create an open-ended curriculum that drives progress in complex, long-horizon worlds.
Introducing Dreaming in Code (DiCode). 🧵👇
New paper:
Can you prevent emergent misalignment with inoculation prompting, or by diluting bad data with good?
Prior work suggests you can. We show the misalignment is still present but hiding. It is triggered by adding cues to prompts, evoking the bad data.
Can LLMs simply tell us about unwanted behaviors they’ve picked up in training?
We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors.
It generalizes to detecting hidden misalignment, backdoors and safeguard removal.
We evaluated Claude Mythos Preview, Opus 4.7 and other models with our updated alignment evaluation methodology, including a new continuation eval, improved evaluation and prefill awareness measurements.
Details including new methodology in 🧵:
Exciting new finding: LLMs struggle to *discover* a latent planning strategy for a task that is trivial when taught step-by-step.
Scaling helps surprisingly little: from an 8-layer model to GPT-5.4 buys only 4 extra steps.
We argue this is good news, for CoT monitoring ⤵️
Introducing ✨Infusion✨, our *new paper* made possible by the UK AISI Challenge Fund and Sovereign AI!
1/8🧵 TL;DR
Influence functions are commonly used to attribute model behavior to its training data. In this paper we explored the reverse: whether it's possible to use influence functions to craft training data that induces model behavior?
Huge thank you to my amazing collaborators for making this possible
@LauraRuis@_robertkirk@egrefen@j_foerst and of course
@AISecurityInst and @UKSovereignAI!
What if you kept asking an LLM to "make it better"? In some recent work at FAIR, we investigate how we can efficiently use RL to fine-tune LLMs to iteratively self-improve on their previous solutions at inference-time.
Training for iterated self-improvement can be costly. The naive approach to training for K self-improvement steps leads to K times the number of rollout steps per episode.
We introduce Exploratory Iteration (ExIt), an RL-based automatic curriculum method that bootstraps diverse training distributions of self-improvement tasks by upcycling the LLM's own responses at previous turns as the starting points for both self-improvement and *self-divergence.*
In order to decide what task to train on next, the curriculum prioritizes sampling of partial turn histories that led to higher return variance in its GRPO group (a learnability score that comes for free).
This automatic curriculum over the bootstrapped task space teaches the model how to perform iterated self-improvement while only ever training the model on single-step self-improvement tasks.
We look at ExIt's impact in both single-turn (contest math problems) and multi-turn (BFCLv3 multi-turn tasks), as well as MLE-bench, where the LLM is run in a search scaffold to produce solutions to real Kaggle competitions. Across these eval settings, we find ExIt produces models with greater capacity for inference-time self-improvement compared to GRPO. Notably, ExIt models can self-improve on test tasks for many more steps than the typical solution depth encountered during training, including a 22% improvement in MLE-bench performance compared to GRPO.
Almost all agentic pipelines prompt LLMs to explicitly plan before every action (ReAct), but turns out this isn't optimal for Multi-Step RL 🤔 Why?
In our new work we highlight a crucial issue with ReAct and show that we should make and follow plans instead🧵
🚀 Avec l'ETH Zürich et le CSCS, nous avons annoncé aujourd’hui la sortie d’Apertus, le premier LLM de grande ampleur, multilingue et open source développé en 🇨🇭. Il représente une étape majeure pour la transparence et la diversité dans l’IA générative.
https://t.co/yHnhsT7N8p
We have released #AgentCoMa, an agentic reasoning benchmark where each task requires a mix of commonsense and math to be solved 🧐
LLM agents performing real-world tasks should be able to combine these different types of reasoning, but are they fit for the job? 🤔
🧵⬇️
Excited to share our new @pyribs tutorial on Quality Diversity through AI Feedback (QDAIF)! This tutorial integrates pyribs with LLMs (via @langchain & @ollama) to write diverse stories about "a suspicious spy and a rich politician." Available here: https://t.co/AxAkDSVYQu
New paper: What happens when an LLM reasons?
We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention
We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵
1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?!
We're releasing two papers exploring why! We:
- Open source small clean EM models
- Show EM is driven by a single evil vector
- Show EM has a mechanistic phase transition
Thought real machine unlearning was impossible? We show that distilling a conventionally “unlearned” model creates a model resistant to relearning attacks. 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 𝐦𝐚𝐤𝐞𝐬 𝐮𝐧𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐫𝐞𝐚𝐥.