People are increasingly worried that AI tools make us overreliant.
But how do we actually measure this? We introduce Offloading Score, a measure of reliance based on the fraction of cognitive effort offloaded to AI while completing a task.
In a controlled user study, Offloading Score detects increased reliance under time pressure, while several common alternatives do not.
(1/9)
We built a joint experimental and computational platform for scalable multi-modal single-cell chemical screens — profiling RNA, protein (including phospho-signaling), and chromatin accessibility responses to thousands of small molecule perturbations in parallel. https://t.co/M5x4CNLCTA
New paper! Have you or a loved one been harmed by a bad multiple-choice benchmark? 😔
You may be entitled to a more reliable evaluation 🩺
At #ACL2026, we'll present BenchMarker: a toolkit to diagnose common flaws in MCQA benchmarks, inspired by best practices in education 🧑🏫🧵
Ran some 🧪 with @irisiris_l to 🔬 why the Granta story was certainly 🤖 slop
A lot of bad writing happens coz AI hasn’t learned aesthetics. It has memorized the whole internet and called it a day.
So sure, maybe you don't trust AI detectors. But you can trust your own 👁️.
Looking for 1 emergency reviewer for a
@COLM_conf paper on Image Editing of Diffusion Model due Wednesday (05/20). Please DM me if interested. Thanks!
Retweets appreciated
We like to delegate tasks to LLMs, where the models perform long-horizon and iterative operations on documents and return the results. But how far can you trust the models to stay faithful to the original content of the document?
A new study by Microsoft Research answers this question with an interesting technique. They use "round-trip relay" tasks to evaluate the capability of LLMs to perform accurate edit tasks on documents.
Basically, it is like back-translation. The model has to perform an operation on the document and then reverse it to produce the original content. To simulate multi-iteration tasks, they chain these operations together across 20 steps. They created a benchmark, DELEGATE-52, which measures delegation performance across different domains.
The results:
- Even the best LLMs corrupt up to 25% of the document contents.
- Corruption doesn't happen by "death by a thousand cuts." Usually, one mishap derails the model, and it can happen on any of the iterations
- Generic tools for code execution and file read/write access don't make things better
What's the main takeaway:
- Only delegate to the extent that you stay in control
- Create domain-specific tools
Many thanks to @PhilippeLaban for sharing comments and actionable insights for developers.
Ask an LLM for a "post that'll pop off". Output misses so you say "more unique." It asks "unique how?" but you don't know yet.
DiscoverLLM (ICML'26): trains LLMs to help users discover their intents, not just execute them.
📑 https://t.co/K9iaaJQzAV
🌐 https://t.co/7Khdsc7u97
We use LLMs to role-play "users" to train, evaluate, and improve AI assistants. How do you know if your user simulator is any good? We argue: rather than measuring how realistic it sounds, start measuring how the assistants it trains perform with real humans. 🧵👇
Today, we’re excited to launch Recursive (@recursive_si): an exceptional team across London and San Francisco, building AI systems that can safely improve their own capabilities over time.
User simulators have emerged as promising tools for building interactive AI, but what makes a “good” simulator?
We reframe the problem as what creates downstream value for humans
Our new simulator test: how an LLM assistant trained with the simulator performs with human users🧵
What happens when you compare the distributions of real and simulated user behaviors?
🔍 The gap is large.
We introduce a method to measure this gap and evaluate 24 LLM-based user simulators across coding and writing tasks.
@convai_uiuc@MSFTResearch@berkeley_ai
🧵 1/N
We've just released open source MTP style drafters for Gemma 4 models ⚡
Now Gemma 4 models are even faster on your choice of hardware, without losing quality!
Grateful for the fruitful collaboration between my team, Gemma team, and many collaborators to enable this release!
Hey @xtuffai, great question!
The main experiment is indeed "first party" as you mention.
But in the paper, we also implement a simple agentic harness (with Python code execution that can write files as a tool), and find that the four LLMs we test with tools (i.e., agentic) perform worse than without tools (non-agentic).
We explain those results and why the agentic setup doesn't just resolve the issue in Section 4.2 of the paper. Would love to know what you think!
New Microsoft paper shows that current AI assistants often damage documents during long editing jobs.
Even the frontier models still ended up corrupting about 25% of document content on average, while many other models damaged far more.
The problem is that delegated AI work only makes sense if a model can keep a document correct across many edits, not just do 1 step well.
The paper tests this with reversible task pairs, where a model edits a file and then tries to undo that edit, so a reliable system should return to the original document.
The authors built real work setups across 52 domains, from coding and science to accounting and music notation, and ran 19 models through 20 editing interactions.
The failures were usually not lots of tiny slips but occasional big mistakes that silently broke parts of the document and then compounded over time.
Agentic tool use did not help in their tests, and bigger files, longer workflows, and irrelevant extra documents made the corruption worse.
The reason this matters is that current LLMs can look strong in short demos or narrow coding tasks yet still be unreliable delegates for long real-world document work.
----
Paper Link – arxiv. org/abs/2604.15597
Paper Title: "LLMs Corrupt Your Documents When You Delegate"
Excited to share our ACL 2026 work, trying to solve the issue raised by the ICLR Outstanding Paper “LLMs Get Lost In Multi-Turn Conversation”!
Our RLAAR (https://t.co/CVUOavVtq7) is an RL framework that trains LLMs to both answer correctly and wait when context is insufficient, using verifiable accuracy and abstention rewards.
This tackles a key weakness in today’s conversational LLMs: they often answer too early, make wrong assumptions, and struggle to recover as conversations unfold.
We’re also excited to see this challenge highlighted by “LLMs Get Lost In Multi-Turn Conversation” (https://t.co/tISe06KGXW) being recognized as an ICLR 2026 Outstanding Paper.
Reliable conversational AI needs to know when to answer — and when to hold back.
#ACL2026 #ICLR2026 #LLM #RLVR #ConversationalAI
We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild.
In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check.
Data, paper, & findings in the 🧵👇
We study sampling diverse output from a suite of LLMs. One key surprise for me was that it's better to carefully pick a single model to sample many times, rather than naively mixing outputs from multiple models.
Can LLMs generate diverse outputs for open-ended questions? Is it helpful if we ensemble outputs from multiple models? We study 18 LLMs on 4 datasets and find that no single model is best at generating diverse outputs 👇/ 🧵