Philippe Laban

@PhilippeLaban

Research Scientist @MSFTResearch. NLP/HCI Research.

New York City

Joined April 2022

823 Following

1.5K Followers

410 Posts

PhilippeLaban retweeted

Vishakh Padmakumar

@vishakh_pk

about 7 hours ago

People are increasingly worried that AI tools make us overreliant. But how do we actually measure this? We introduce Offloading Score, a measure of reliance based on the fraction of cognitive effort offloaded to AI while completing a task. In a controlled user study, Offloading Score detects increased reliance under time pressure, while several common alternatives do not. (1/9)

$vishakh_pk's tweet photo. People are increasingly worried that AI tools make us overreliant. But how do we actually measure this? We introduce Offloading Score, a measure of reliance based on the fraction of cognitive effort offloaded to AI while completing a task. In a controlled user study, Offloading Score detects increased reliance under time pressure, while several common alternatives do not. (1/9)$

PhilippeLaban retweeted

Romain Lopez

@_romain_lopez_

5 days ago

We built a joint experimental and computational platform for scalable multi-modal single-cell chemical screens — profiling RNA, protein (including phospho-signaling), and chromatin accessibility responses to thousands of small molecule perturbations in parallel. https://t.co/M5x4CNLCTA

_romain_lopez_'s tweet photo. We built a joint experimental and computational platform for scalable multi-modal single-cell chemical screens — profiling RNA, protein (including phospho-signaling), and chromatin accessibility responses to thousands of small molecule perturbations in parallel. https://t.co/M5x4CNLCTA

180

121

13K

PhilippeLaban retweeted

Nishant Balepur @NishantBalepur

7 days ago

New paper! Have you or a loved one been harmed by a bad multiple-choice benchmark? 😔 You may be entitled to a more reliable evaluation 🩺 At #ACL2026, we'll present BenchMarker: a toolkit to diagnose common flaws in MCQA benchmarks, inspired by best practices in education 🧑‍🏫🧵

NishantBalepur's tweet photo. New paper! Have you or a loved one been harmed by a bad multiple-choice benchmark? 😔

You may be entitled to a more reliable evaluation 🩺

At #ACL2026, we'll present BenchMarker: a toolkit to diagnose common flaws in MCQA benchmarks, inspired by best practices in education 🧑‍🏫🧵 https://t.co/pNxlAQsdi9

PhilippeLaban retweeted

Tuhin Chakrabarty

@TuhinChakr

12 days ago

Ran some 🧪 with @irisiris_l to 🔬 why the Granta story was certainly 🤖 slop A lot of bad writing happens coz AI hasn’t learned aesthetics. It has memorized the whole internet and called it a day. So sure, maybe you don't trust AI detectors. But you can trust your own 👁️.

TuhinChakr's tweet photo. Ran some 🧪 with @irisiris_l to 🔬 why the Granta story was certainly 🤖 slop

A lot of bad writing happens coz AI hasn’t learned aesthetics. It has memorized the whole internet and called it a day.

So sure, maybe you don't trust AI detectors. But you can trust your own 👁️. https://t.co/25OfkN4nV8

Who to follow

Tanya Goyal

@tanyaagoyal

Faculty @Cornell_CS. she/her

Shafiq Joty

@JotyShafiq

Sr. Research Director@Salesforce AI, Assoc. Prof@NTU (on leave) Led SFR-DeepResearch, FARE, SFR-RAG, SFR-Judge, CodeT5, ALBEF, ChartQA

Alex Fabbri

@alexfabbri4

Research @meta superintelligence labs; @scale_AI @SFResearch; PhD @Yale; BA @Columbia; Opinions are my own.

PhilippeLaban retweeted

Tuhin Chakrabarty

@TuhinChakr

17 days ago

Looking for 1 emergency reviewer for a @COLM_conf paper on Image Editing of Diffusion Model due Wednesday (05/20). Please DM me if interested. Thanks! Retweets appreciated

PhilippeLaban retweeted

Ben Dickson

@bendee983

20 days ago

We like to delegate tasks to LLMs, where the models perform long-horizon and iterative operations on documents and return the results. But how far can you trust the models to stay faithful to the original content of the document? A new study by Microsoft Research answers this question with an interesting technique. They use "round-trip relay" tasks to evaluate the capability of LLMs to perform accurate edit tasks on documents. Basically, it is like back-translation. The model has to perform an operation on the document and then reverse it to produce the original content. To simulate multi-iteration tasks, they chain these operations together across 20 steps. They created a benchmark, DELEGATE-52, which measures delegation performance across different domains. The results: - Even the best LLMs corrupt up to 25% of the document contents. - Corruption doesn't happen by "death by a thousand cuts." Usually, one mishap derails the model, and it can happen on any of the iterations - Generic tools for code execution and file read/write access don't make things better What's the main takeaway: - Only delegate to the extent that you stay in control - Create domain-specific tools Many thanks to @PhilippeLaban for sharing comments and actionable insights for developers.

562

PhilippeLaban retweeted

Nishant Balepur @NishantBalepur

22 days ago

MyScholarQA is live! If you want a deep research system that actually knows about your work, check it out 👇 https://t.co/yAUelrELfw

Philippe Laban

@PhilippeLaban

20 days ago

@tae_skim Great work, Tae Soo. The methodology is super clean and the paper was an incredible read!

PhilippeLaban retweeted

Tae Soo Kim @tae_skim

20 days ago

Ask an LLM for a "post that'll pop off". Output misses so you say "more unique." It asks "unique how?" but you don't know yet. DiscoverLLM (ICML'26): trains LLMs to help users discover their intents, not just execute them. 📑 https://t.co/K9iaaJQzAV 🌐 https://t.co/7Khdsc7u97

PhilippeLaban retweeted

Joseph Jeesung Suh @JosephJSSuh

21 days ago

We use LLMs to role-play "users" to train, evaluate, and improve AI assistants. How do you know if your user simulator is any good? We argue: rather than measuring how realistic it sounds, start measuring how the assistants it trains perform with real humans. 🧵👇

JosephJSSuh's tweet photo. We use LLMs to role-play "users" to train, evaluate, and improve AI assistants. How do you know if your user simulator is any good? We argue: rather than measuring how realistic it sounds, start measuring how the assistants it trains perform with real humans. 🧵👇 https://t.co/K33f1Db0FD

PhilippeLaban retweeted

Caiming Xiong

@CaimingXiong

21 days ago

Today, we’re excited to launch Recursive (@recursive_si): an exceptional team across London and San Francisco, building AI systems that can safely improve their own capabilities over time.

123

17K

PhilippeLaban retweeted

Serina Chang @serinachang5

21 days ago

User simulators have emerged as promising tools for building interactive AI, but what makes a “good” simulator? We reframe the problem as what creates downstream value for humans Our new simulator test: how an LLM assistant trained with the simulator performs with human users🧵

serinachang5's tweet photo. User simulators have emerged as promising tools for building interactive AI, but what makes a “good” simulator?

We reframe the problem as what creates downstream value for humans

Our new simulator test: how an LLM assistant trained with the simulator performs with human users🧵 https://t.co/Nhf4Bz7U74

131

15K

PhilippeLaban retweeted

Shuhaib Mehri

@shuhaibmehri

23 days ago

What happens when you compare the distributions of real and simulated user behaviors? 🔍 The gap is large. We introduce a method to measure this gap and evaluate 24 LLM-based user simulators across coding and writing tasks. @convai_uiuc @MSFTResearch @berkeley_ai 🧵 1/N

shuhaibmehri's tweet photo. What happens when you compare the distributions of real and simulated user behaviors?

🔍 The gap is large.

We introduce a method to measure this gap and evaluate 24 LLM-based user simulators across coding and writing tasks.

@convai_uiuc @MSFTResearch @berkeley_ai
🧵 1/N https://t.co/HnipcqrYeJ

192

145

30K

PhilippeLaban retweeted

Tal Schuster @TalSchuster

29 days ago

We've just released open source MTP style drafters for Gemma 4 models ⚡ Now Gemma 4 models are even faster on your choice of hardware, without losing quality! Grateful for the fruitful collaboration between my team, Gemma team, and many collaborators to enable this release!

Philippe Laban

@PhilippeLaban

about 1 month ago

Hey @xtuffai, great question! The main experiment is indeed "first party" as you mention. But in the paper, we also implement a simple agentic harness (with Python code execution that can write files as a tool), and find that the four LLMs we test with tools (i.e., agentic) perform worse than without tools (non-agentic). We explain those results and why the agentic setup doesn't just resolve the issue in Section 4.2 of the paper. Would love to know what you think!

PhilippeLaban's tweet photo. Hey @xtuffai, great question!
The main experiment is indeed "first party" as you mention.

But in the paper, we also implement a simple agentic harness (with Python code execution that can write files as a tool), and find that the four LLMs we test with tools (i.e., agentic) perform worse than without tools (non-agentic).

We explain those results and why the agentic setup doesn't just resolve the issue in Section 4.2 of the paper. Would love to know what you think!

PhilippeLaban retweeted

Rohan Paul

@rohanpaul_ai

about 1 month ago

New Microsoft paper shows that current AI assistants often damage documents during long editing jobs. Even the frontier models still ended up corrupting about 25% of document content on average, while many other models damaged far more. The problem is that delegated AI work only makes sense if a model can keep a document correct across many edits, not just do 1 step well. The paper tests this with reversible task pairs, where a model edits a file and then tries to undo that edit, so a reliable system should return to the original document. The authors built real work setups across 52 domains, from coding and science to accounting and music notation, and ran 19 models through 20 editing interactions. The failures were usually not lots of tiny slips but occasional big mistakes that silently broke parts of the document and then compounded over time. Agentic tool use did not help in their tests, and bigger files, longer workflows, and irrelevant extra documents made the corruption worse. The reason this matters is that current LLMs can look strong in short demos or narrow coding tasks yet still be unreliable delegates for long real-world document work. ---- Paper Link – arxiv. org/abs/2604.15597 Paper Title: "LLMs Corrupt Your Documents When You Delegate"

rohanpaul_ai's tweet photo. New Microsoft paper shows that current AI assistants often damage documents during long editing jobs.

Even the frontier models still ended up corrupting about 25% of document content on average, while many other models damaged far more.

The problem is that delegated AI work only makes sense if a model can keep a document correct across many edits, not just do 1 step well.

The paper tests this with reversible task pairs, where a model edits a file and then tries to undo that edit, so a reliable system should return to the original document.

The authors built real work setups across 52 domains, from coding and science to accounting and music notation, and ran 19 models through 20 editing interactions.

The failures were usually not lots of tiny slips but occasional big mistakes that silently broke parts of the document and then compounded over time.

Agentic tool use did not help in their tests, and bigger files, longer workflows, and irrelevant extra documents made the corruption worse.

The reason this matters is that current LLMs can look strong in short demos or narrow coding tasks yet still be unreliable delegates for long real-world document work.

----

Paper Link – arxiv. org/abs/2604.15597

Paper Title: "LLMs Corrupt Your Documents When You Delegate"

306

203

51K

PhilippeLaban retweeted

Ming Li @ UMD PhD

@Ming_Liiii

about 1 month ago

Excited to share our ACL 2026 work, trying to solve the issue raised by the ICLR Outstanding Paper “LLMs Get Lost In Multi-Turn Conversation”! Our RLAAR (https://t.co/CVUOavVtq7) is an RL framework that trains LLMs to both answer correctly and wait when context is insufficient, using verifiable accuracy and abstention rewards. This tackles a key weakness in today’s conversational LLMs: they often answer too early, make wrong assumptions, and struggle to recover as conversations unfold. We’re also excited to see this challenge highlighted by “LLMs Get Lost In Multi-Turn Conversation” (https://t.co/tISe06KGXW) being recognized as an ICLR 2026 Outstanding Paper. Reliable conversational AI needs to know when to answer — and when to hold back. #ACL2026 #ICLR2026 #LLM #RLVR #ConversationalAI

Ming_Liiii's tweet photo. Excited to share our ACL 2026 work, trying to solve the issue raised by the ICLR Outstanding Paper “LLMs Get Lost In Multi-Turn Conversation”!

Our RLAAR (https://t.co/CVUOavVtq7) is an RL framework that trains LLMs to both answer correctly and wait when context is insufficient, using verifiable accuracy and abstention rewards.

This tackles a key weakness in today’s conversational LLMs: they often answer too early, make wrong assumptions, and struggle to recover as conversations unfold.

We’re also excited to see this challenge highlighted by “LLMs Get Lost In Multi-Turn Conversation” (https://t.co/tISe06KGXW) being recognized as an ICLR 2026 Outstanding Paper.

Reliable conversational AI needs to know when to answer — and when to hold back.

#ACL2026 #ICLR2026 #LLM #RLVR #ConversationalAI

PhilippeLaban retweeted

Joachim Baumann @ ICLR'26

@joabaum

about 1 month ago

We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇

joabaum's tweet photo. We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild.

In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check.

Data, paper, & findings in the 🧵👇

474

295

70K

PhilippeLaban retweeted

Eunsol Choi

@eunsolc

about 1 month ago

We study sampling diverse output from a suite of LLMs. One key surprise for me was that it's better to carefully pick a single model to sample many times, rather than naively mixing outputs from multiple models.

PhilippeLaban retweeted

Yuhan Liu @YuhanLiu_nlp

about 1 month ago

Can LLMs generate diverse outputs for open-ended questions? Is it helpful if we ensemble outputs from multiple models? We study 18 LLMs on 4 datasets and find that no single model is best at generating diverse outputs 👇/ 🧵

YuhanLiu_nlp's tweet photo. Can LLMs generate diverse outputs for open-ended questions? Is it helpful if we ensemble outputs from multiple models? We study 18 LLMs on 4 datasets and find that no single model is best at generating diverse outputs 👇/ 🧵 https://t.co/5GRrRE13fg

174

116

24K

Philippe Laban

@PhilippeLaban

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users