Hung-Ting Chen

@hungting_chen

PhD student at @nyuniversity, working on NLP.

Joined May 2021

436 Following

299 Followers

127 Posts

hungting_chen retweeted

Vishakh Padmakumar

@vishakh_pk

21 days ago

People are increasingly worried that AI tools make us overreliant. But how do we actually measure this? We introduce Offloading Score, a measure of reliance based on the fraction of cognitive effort offloaded to AI while completing a task. In a controlled user study, Offloading Score detects increased reliance under time pressure, while several common alternatives do not. (1/9)

$vishakh_pk's tweet photo. People are increasingly worried that AI tools make us overreliant. But how do we actually measure this? We introduce Offloading Score, a measure of reliance based on the fraction of cognitive effort offloaded to AI while completing a task. In a controlled user study, Offloading Score detects increased reliance under time pressure, while several common alternatives do not. (1/9)$

213

77K

hungting_chen retweeted

Nicholas Tomlin @NickATomlin

28 days ago

New paper! LLM memory keeps improving, but this makes them *worse* as user sims. If we want to build models that can, e.g., simulate realistic students to train chatbots to be better teachers, then these models need to be able to forget like humans do 📄: https://t.co/1GpOfwcsat

NickATomlin's tweet photo. New paper! LLM memory keeps improving, but this makes them *worse* as user sims. If we want to build models that can, e.g., simulate realistic students to train chatbots to be better teachers, then these models need to be able to forget like humans do

📄: https://t.co/1GpOfwcsat https://t.co/IDePa4f6gw

459

321

46K

hungting_chen retweeted

Nishant Balepur @NishantBalepur

28 days ago

New paper! Have you or a loved one been harmed by a bad multiple-choice benchmark? 😔 You may be entitled to a more reliable evaluation 🩺 At #ACL2026, we'll present BenchMarker: a toolkit to diagnose common flaws in MCQA benchmarks, inspired by best practices in education 🧑‍🏫🧵

NishantBalepur's tweet photo. New paper! Have you or a loved one been harmed by a bad multiple-choice benchmark? 😔

You may be entitled to a more reliable evaluation 🩺

At #ACL2026, we'll present BenchMarker: a toolkit to diagnose common flaws in MCQA benchmarks, inspired by best practices in education 🧑‍🏫🧵 https://t.co/pNxlAQsdi9

hungting_chen retweeted

Nishant Balepur @NishantBalepur

about 1 month ago

🚨 New Paper! 🚨 One of my first Ph.D. papers found that LLMs can answer multiple-choice questions without seeing the question 🤔 At #ACL2026, I'm presenting a follow-up showing that current reasoning LLMs can still do this! And quite similarly to a clever test-taker 🧑‍🎓🧵

NishantBalepur's tweet photo. 🚨 New Paper! 🚨

One of my first Ph.D. papers found that LLMs can answer multiple-choice questions without seeing the question 🤔

At #ACL2026, I'm presenting a follow-up showing that current reasoning LLMs can still do this! And quite similarly to a clever test-taker 🧑‍🎓🧵 https://t.co/X25UnlSJY2

110

828

Who to follow

Hanna Hajishirzi

@HannaHajishirzi

VP@Microsoft-AI; past: Olmo, Tulu

Puyuan Peng

@PuyuanPeng

Research Scientist @Meta Superintelligence Lab. Speech & Audio. Previously @utaustin @uchicago @bnu_1902

Prasann Singhal

@prasann_singhal

1st-year #NLProc PhD at UC Berkeley working with @sewon__min / @JacobSteinhardt , formerly advised by @gregd_nlp

hungting_chen retweeted

Nishant Balepur @NishantBalepur

about 1 month ago

MyScholarQA is live! If you want a deep research system that actually knows about your work, check it out 👇 https://t.co/yAUelrELfw

hungting_chen retweeted

Jocelyn Qiaochu Chen

@jocelynqchen

about 1 month ago

🚀 New paper: FlexSQL — a Text-to-SQL agent that lets gpt-oss reach 65.4% on Spider2, outperforming agents built with large models like DeepSeek-R1 and o3. The key: just let agent explore flexibly. Don’t collapse a complex query into one schema, one interpretation, and one SQL program too early. FlexSQL keeps multiple grounded solution paths alive, then executes them with the tools/language each path needs. 🧵

hungting_chen retweeted

Yuhan Liu @YuhanLiu_nlp

about 2 months ago

Can LLMs generate diverse outputs for open-ended questions? Is it helpful if we ensemble outputs from multiple models? We study 18 LLMs on 4 datasets and find that no single model is best at generating diverse outputs 👇/ 🧵

YuhanLiu_nlp's tweet photo. Can LLMs generate diverse outputs for open-ended questions? Is it helpful if we ensemble outputs from multiple models? We study 18 LLMs on 4 datasets and find that no single model is best at generating diverse outputs 👇/ 🧵 https://t.co/5GRrRE13fg

176

116

24K

hungting_chen retweeted

Yoonsang Lee @ ICML 26 @yoonsang_

2 months ago

How should we effectively aggregate long-horizon agent trajectories? 🧐 Unlike CoT reasoning, agentic tasks pose unique challenges: they are long, multi-turn, and tool-augmented. Introducing 👉🏻 AggAgent 👈🏻 — which treats parallel trajectories as an environment to interact with.

yoonsang_'s tweet photo. How should we effectively aggregate long-horizon agent trajectories? 🧐

Unlike CoT reasoning, agentic tasks pose unique challenges: they are long, multi-turn, and tool-augmented.

Introducing 👉🏻 AggAgent 👈🏻 — which treats parallel trajectories as an environment to interact with. https://t.co/MMnDF6VD0z

261

191

28K

hungting_chen retweeted

Nishant Balepur @NishantBalepur

2 months ago

Excited to share MyScholarQA - a personalized deep research tool that learns from your papers and lets you customize reports! 🧑‍🔬🖌️ Our #ACL2026 paper built and evaluated it, showing simulated users (LLMs) couldn't mimic what real users wanted 🙅 Spicy results + a live demo 👇🧵

10K

hungting_chen retweeted

Ayush Jhaveri @arhjhaveri

3 months ago

Your AI Agent just formed a hypothesis. 💭 How does it validate it? Not by trying to prove itself wrong. Rather, it selectively seeks evidence that confirms what it already believes, often ending up with the wrong answer! Confirmation bias isn’t just human. We measure it in LLMs, and we show how to fix it! 🧵

arhjhaveri's tweet photo. Your AI Agent just formed a hypothesis. 💭 How does it validate it?

Not by trying to prove itself wrong. Rather, it selectively seeks evidence that confirms what it already believes, often ending up with the wrong answer!

Confirmation bias isn’t just human. We measure it in LLMs, and we show how to fix it! 🧵

Hung-Ting Chen @hungting_chen

3 months ago

@NishantBalepur @rachelrudinger @boydgraber @eunsolc @arnaik19 @vipul_1011 @StevenJMoore Wow congrats! Looking forward to those 🚨posts

229

hungting_chen retweeted

Zayne Sprague

@ZayneSprague

3 months ago

https://t.co/Zyo6d1sGmL

133

11K

hungting_chen retweeted

Manya Wadhwa @ManyaWadhwa1

3 months ago

⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs. Making novel, meaningful connections is key for scientific & creative works. We objectively measure how well LLMs can do this. 🧵👇

ManyaWadhwa1's tweet photo. ⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs.

Making novel, meaningful connections is key for scientific & creative works.

We objectively measure how well LLMs can do this. 🧵👇 https://t.co/Hf7005q4om

144

22K

hungting_chen retweeted

Fangyuan Xu @brunchavecmoi

4 months ago

A lot of useful training data can't be shared due to privacy. How do we create synthetic training data without data owners ever sharing their content? 🚀 Introducing 𝐃𝐏-𝐑𝐅𝐓: using RL to train LLMs to generate high-fidelity domain data without seeing a single private sample.

brunchavecmoi's tweet photo. A lot of useful training data can't be shared due to privacy. How do we create synthetic training data without data owners ever sharing their content?
🚀 Introducing 𝐃𝐏-𝐑𝐅𝐓: using RL to train LLMs to generate high-fidelity domain data without seeing a single private sample. https://t.co/U3WxKycja1

133

11K

hungting_chen retweeted

Sumit @_reachsumit

4 months ago

RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering @denq1an et al. use an LLM verifier to identify high-quality documents and conditions subsequent retrieval on verified results to maximize answer coverage. 📝 https://t.co/WtoqzHQqLJ 👨🏽‍💻 https://t.co/mg4b7UXDZg

698

Hung-Ting Chen @hungting_chen

4 months ago

It was nice collaborating with @denq1an on this project! Comprehensively retrieving all answers is hard for agentic approaches that are optimized to cover a single answer. We propose a iterative framework that uncovers new answers based on previously retrieved information.

Deniz Qian @denq1an

4 months ago

🚨NEW PAPER🚨 How can we comprehensively retrieve all relevant docs for multi-answer QA? Agentic search doesn't help. Introducing RVR, an iterative framework that conditions on prior docs to maximize answer coverage. 📈10% answer recall gain on QAMPARI w/@hungting_chen @eunsolc

denq1an's tweet photo. 🚨NEW PAPER🚨
How can we comprehensively retrieve all relevant docs for multi-answer QA? Agentic search doesn't help.

Introducing RVR, an iterative framework that conditions on prior docs to maximize answer coverage.

📈10% answer recall gain on QAMPARI
w/@hungting_chen @eunsolc https://t.co/QiB2BqvjML

417

hungting_chen retweeted

Wenxuan Ding @Wenxuan_Ding_

4 months ago

Agents interact with environments to gather information. But exploration can be expensive. Tool use, retrieval, and user interaction carry latency or monetary cost. Calibrate-Then-Act allows LLM agents to balance exploration with cost: 📐 Estimate uncertainty about the environment 💭 Reason about cost-uncertainty tradeoffs ⚙️ Act accordingly

Wenxuan_Ding_'s tweet photo. Agents interact with environments to gather information. But exploration can be expensive.
Tool use, retrieval, and user interaction carry latency or monetary cost.

Calibrate-Then-Act allows LLM agents to balance exploration with cost:
📐 Estimate uncertainty about the environment
💭 Reason about cost-uncertainty tradeoffs
⚙️ Act accordingly

119

12K

hungting_chen retweeted

Zayne Sprague

@ZayneSprague

7 months ago

RL amplifies existing behaviors. Let’s prime models w/ good behaviors for better RL! Introducing SkillFactory: ✂️Rearrange model traces on a problem to demo verification + retry ⚙️SFT on those traces 🦾RL Result: Learn robust explicit verification + retry across domains 🧵

ZayneSprague's tweet photo. RL amplifies existing behaviors. Let’s prime models w/ good behaviors for better RL!

Introducing SkillFactory:
✂️Rearrange model traces on a problem to demo verification + retry
⚙️SFT on those traces
🦾RL

Result: Learn robust explicit verification + retry across domains 🧵

21K

Hung-Ting Chen @hungting_chen

6 months ago

@YungSungChuang @yoonrkim @jacobandreas Yung-Sung my hero

hungting_chen retweeted

Amanda Bertsch @abertsch72

8 months ago

Can LLMs accurately aggregate information over long, information-dense texts? Not yet… We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

abertsch72's tweet photo. Can LLMs accurately aggregate information over long, information-dense texts? Not yet…

We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong! https://t.co/owTnNO3RF9

358

219

81K

Hung-Ting Chen

@hungting_chen

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users