Wei-Lin Chen @weilin__chen - Twitter Profile

Pinned Tweet

4 months ago

🚀 New paper from my internship at @Google! LLMs can “think” for a long time only to get the answer wrong — more tokens do not always help and may be overthinking 😵‍💫 We introduce Deep-Thinking Ratio (DTR), a new way to measure LLM reasoning effort. The idea: Count the tokens models had to think deeply to produce. 🧵

WeiLin__Chen's tweet photo. 🚀 New paper from my internship at @Google!

LLMs can “think” for a long time only to get the answer wrong — more tokens do not always help and may be overthinking 😵‍💫

We introduce Deep-Thinking Ratio (DTR), a new way to measure LLM reasoning effort.

The idea: Count the tokens models had to think deeply to produce.

🧵

18

621

71

408

46K

WeiLin__Chen retweeted

Zhepei Wei

@weizhepei

7 days ago

🎉 Honored to receive the @CapitalOne PhD Fellowship! Many thanks to my advisor @yumeng0818 and my collaborators for their guidance and support throughout my PhD journey at @CS_UVA @UVAEngineers! 💙🧡 Excited to continue building more capable, reliable, and efficient AI systems! https://t.co/LJzPFsFiz5

2

29

4

1

5K

WeiLin__Chen retweeted

Zhepei Wei

@weizhepei

about 1 month ago

😢RLVR is powerful but expensive 🤯Imagine using <20% RLVR training while achieving 100% performance? Sounds surprising? We show that minimal RLVR training is enough to know where training is going, and predict future ckpts at no training cost! 📃https://t.co/fGODWWIjR1 🧵[1/n]

weizhepei's tweet photo. 😢RLVR is powerful but expensive
🤯Imagine using <20% RLVR training while achieving 100% performance?

Sounds surprising? We show that minimal RLVR training is enough to know where training is going, and predict future ckpts at no training cost!

📃https://t.co/fGODWWIjR1
🧵[1/n] https://t.co/pfnnjK3xxd

7

249

45

217

49K

WeiLin__Chen retweeted

ChengSong Huang

@ChengsongH31219

about 1 month ago

"How do you self-improve a model on open-ended tasks where you can't take a majority vote?" I got asked this in nearly every research interview I did last year. None of my answers felt clean. So we built something that doesn't need a vote, a verifier, or a judge. Meet G-Zero. 👇 paper: https://t.co/TrvGb48W4d huggingface: https://t.co/8guc5xSh3i code: https://t.co/G8mMm2I9h1 All experiments are done via api by @thinkymachines (1/n)

ChengsongH31219's tweet photo. "How do you self-improve a model on open-ended tasks where you can't take a majority vote?"

I got asked this in nearly every research interview I did last year. None of my answers felt clean.

So we built something that doesn't need a vote, a verifier, or a judge.

Meet G-Zero. 👇

paper: https://t.co/TrvGb48W4d
huggingface: https://t.co/8guc5xSh3i
code: https://t.co/G8mMm2I9h1

All experiments are done via api by @thinkymachines (1/n)

6

239

45

258

15K

Who to follow

Julie Kallini ✈️ ICML✨

@JulieKallini

CS PhD @StanfordNLP 🌲 Previously: SWE @Meta, Class of '21 @PrincetonCS

Laurens van der Maaten

@lvdmaaten

Member of Technical Staff at Anthropic. Ex-Meta. t-SNE. Llama 3. DenseNet. Web-scale weakly supervised vision. CrypTen.

Travis Addair

@TravisAddair

Co-Founder & CTO @Predibase OSS: LoRAX (https://t.co/iUD8EzSwJR) | https://t.co/FIOKcmavWX | @ludwig_ai

WeiLin__Chen retweeted

Zhepei Wei

@weizhepei

about 2 months ago

🎉TruthRL is accepted to #ICML2026! A simple ternary reward (correct: +1; abstention: 0; incorrect: −1) helps LLMs answer more accurately and know when not to answer, significantly reducing hallucinations! Paper + code 👇 📄 https://t.co/OXPYb08PJz 💻 https://t.co/bjySx2EA2u

2

30

7

2K

WeiLin__Chen retweeted

Tu Vu

@tuvllms

4 months ago

🚨 New paper 🚨 Excited to share our new work on EvoSkill (led by @salahalzubi401), a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis 🔁🧬. It achieves state-of-the-art performance on @databricks's OfficeQA. Check it out! 📰: https://t.co/jbBbQLUm51

6

122

37

43

17K

WeiLin__Chen retweeted

Tu Vu

@tuvllms

4 months ago

🚨 New paper 🚨 Excited to share PRISM, a new “DeepThink” method that uses step-level correctness signals from a process reward model to guide inference over candidate solutions. PRISM matches or beats SOTA methods, enabling gpt-oss-20b to exceed gpt-oss-120b.👇 📰: https://t.co/hdjXZ3PjJJ #AI #LLMs

tuvllms's tweet photo. 🚨 New paper 🚨

Excited to share PRISM, a new “DeepThink” method that uses step-level correctness signals from a process reward model to guide inference over candidate solutions. PRISM matches or beats SOTA methods, enabling gpt-oss-20b to exceed gpt-oss-120b.👇

📰: https://t.co/hdjXZ3PjJJ

#AI #LLMs

3

137

32

95

14K

WeiLin__Chen retweeted

DAIR.AI

@dair_ai

4 months ago

https://t.co/tVmzv8v2mW

7

594

69

875

144K

WeiLin__Chen retweeted

Liqian Peng @LiqianPeng

4 months ago

Excited to share this new work from @Google! 🚀 Our intern @WeiLin__Chen explored how to measure true LLM reasoning effort. We found that "longer" isn't always "smarter"—it's about the Deep-Thinking Ratio (DTR). 🧠📊 Check out the full paper below! 👇

0

5

1

2

398

Wei-Lin Chen

@WeiLin__Chen

4 months ago

@zhaochaocs @omarsar0 Also check out this related work! https://t.co/xuMkJyFvuW

Souradip Chakraborty

@SOURADIPCHAKR18

about 1 year ago

🔥 Does test-time scaling in #reasoningmodels via thinking more always help? 🚫 Answer is No - Performance increases first and then drops due to #Overthinking ❓Why is this behaviour and how to mitigate 🚀 Check our recent findings #LLMReasoning Link: https://t.co/V0IOoFqAgY

SOURADIPCHAKR18's tweet photo. 🔥 Does test-time scaling in #reasoningmodels via thinking more always help?
🚫 Answer is No - Performance increases first and then drops due to #Overthinking
❓Why is this behaviour and how to mitigate
🚀 Check our recent findings #LLMReasoning
Link: https://t.co/V0IOoFqAgY https://t.co/MKvUnFEWJF

4

84

20

60

23K

2

4

1

2

157

Wei-Lin Chen

@WeiLin__Chen

4 months ago

🚀 New paper from my internship at @Google! LLMs can “think” for a long time only to get the answer wrong — more tokens do not always help and may be overthinking 😵‍💫 We introduce Deep-Thinking Ratio (DTR), a new way to measure LLM reasoning effort. The idea: Count the tokens models had to think deeply to produce. 🧵

18

621

71

408

46K

Wei-Lin Chen

@WeiLin__Chen

4 months ago

@zhaochaocs Also thanks to @omarsar0 for covering our work --> https://t.co/E2uQ84Utwr

elvis

@omarsar0

4 months ago

New Google paper challenges how we measure LLM reasoning. Token count is a poor proxy for actual reasoning quality. There might be a better way to measure this. This work introduces "deep-thinking tokens," a metric that identifies tokens where internal model predictions shift significantly across deeper layers before stabilizing. These tokens capture "genuine reasoning" effort rather than verbose output. Instead of measuring how much a model writes, measure how hard it's actually thinking at each step. Deep-thinking tokens are identified by tracking prediction instability across transformer layers during inference. The ratio of deep-thinking tokens correlates more reliably with accuracy than token count or confidence metrics across mathematical and scientific benchmarks (AIME 24/25, HMMT 25, GPQA-diamond), tested on DeepSeek-R1, Qwen3, and GPT-OSS. They also introduce Think@n, a test-time compute strategy that prioritizes samples with high deep-thinking ratios while early-rejecting low-quality partial outputs, reducing cost without sacrificing performance. Why does it matter? As inference-time scaling becomes a primary lever for improving model performance, we need better signals than token length to understand when a model is actually reasoning versus just rambling. Paper: https://t.co/Yj0bPdiLni Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. New Google paper challenges how we measure LLM reasoning.

Token count is a poor proxy for actual reasoning quality.

There might be a better way to measure this.

This work introduces "deep-thinking tokens," a metric that identifies tokens where internal model predictions shift significantly across deeper layers before stabilizing.

These tokens capture "genuine reasoning" effort rather than verbose output.

Instead of measuring how much a model writes, measure how hard it's actually thinking at each step. Deep-thinking tokens are identified by tracking prediction instability across transformer layers during inference.

The ratio of deep-thinking tokens correlates more reliably with accuracy than token count or confidence metrics across mathematical and scientific benchmarks (AIME 24/25, HMMT 25, GPQA-diamond), tested on DeepSeek-R1, Qwen3, and GPT-OSS.

They also introduce Think@n, a test-time compute strategy that prioritizes samples with high deep-thinking ratios while early-rejecting low-quality partial outputs, reducing cost without sacrificing performance.

Why does it matter?

As inference-time scaling becomes a primary lever for improving model performance, we need better signals than token length to understand when a model is actually reasoning versus just rambling.

Paper: https://t.co/Yj0bPdiLni

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

46

585

114

571

67K

1

2

0

225

Wei-Lin Chen

@WeiLin__Chen

4 months ago

@SOURADIPCHAKR18 @omarsar0 @amritsinghbedi3 @furongh Awesome, thanks for sharing 😀 Gonna add to my thread!

1

4

1

64

WeiLin__Chen retweeted

elvis

@omarsar0

4 months ago

New Google paper challenges how we measure LLM reasoning. Token count is a poor proxy for actual reasoning quality. There might be a better way to measure this. This work introduces "deep-thinking tokens," a metric that identifies tokens where internal model predictions shift significantly across deeper layers before stabilizing. These tokens capture "genuine reasoning" effort rather than verbose output. Instead of measuring how much a model writes, measure how hard it's actually thinking at each step. Deep-thinking tokens are identified by tracking prediction instability across transformer layers during inference. The ratio of deep-thinking tokens correlates more reliably with accuracy than token count or confidence metrics across mathematical and scientific benchmarks (AIME 24/25, HMMT 25, GPQA-diamond), tested on DeepSeek-R1, Qwen3, and GPT-OSS. They also introduce Think@n, a test-time compute strategy that prioritizes samples with high deep-thinking ratios while early-rejecting low-quality partial outputs, reducing cost without sacrificing performance. Why does it matter? As inference-time scaling becomes a primary lever for improving model performance, we need better signals than token length to understand when a model is actually reasoning versus just rambling. Paper: https://t.co/Yj0bPdiLni Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

46

585

114

571

67K

WeiLin__Chen retweeted

Chao Zhao @zhaochaocs

4 months ago

Reasoning effort ≠ just thinking tokens. Awesome work by @WeiLin__Chen to quantify the underlying dynamics of LLM reasoning 🚀🚀

0

4

1

617

WeiLin__Chen retweeted

Yu Meng

@yumeng0818

4 months ago

Is your reasoning LLM actually making progress or just wasting compute?🧐 Excited to share our new preprint led by @WeiLin__Chen! 🤩 We propose a new metric to measure how "deep" LLMs think at each token by identifying internal layer revisions. This correlates much better with accuracy than raw token counts and enables superior & efficient test-time scaling! 📈

0

31

4

15

5K

Wei-Lin Chen

@WeiLin__Chen

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users