Ran Xu @ritaranx - Twitter Profile

Pinned Tweet

8 months ago

🚨 Happy to share AceSearcher accepted to #NeurIPS2025 #Spotlight! 🔹 One LLM, two roles: Decomposer (split queries) + Solver (combine context) 🔹 +7.6% on QA & fact verification 🔹 32B ≈ DeepSeek-V3 on DocMath 📂 Code: https://t.co/lQU12Dm7vb 📑 arXiv: https://t.co/JI0kOh0yDk

ritaranx's tweet photo. 🚨 Happy to share AceSearcher accepted to #NeurIPS2025 #Spotlight!

🔹 One LLM, two roles: Decomposer (split queries) + Solver (combine context)
🔹 +7.6% on QA & fact verification
🔹 32B ≈ DeepSeek-V3 on DocMath
📂 Code: https://t.co/lQU12Dm7vb
📑 arXiv: https://t.co/JI0kOh0yDk https://t.co/BIfyr1tig7

2

34

12

47

8K

Ran Xu @ritaranx

6 months ago

Excited to be at #NeurIPS through Dec 8 — happy to connect! I’ll be presenting our Spotlight paper on complex QA and reasoning with search: 🗓️ Dec 5, 11:00–2:00pm PST 📍 Exhibit C/D/E — Poster #1908 Also exploring full-time opportunities—DMs open if you’d like to chat!

Ran Xu @ritaranx

8 months ago

🚨 Happy to share AceSearcher accepted to #NeurIPS2025 #Spotlight! 🔹 One LLM, two roles: Decomposer (split queries) + Solver (combine context) 🔹 +7.6% on QA & fact verification 🔹 32B ≈ DeepSeek-V3 on DocMath 📂 Code: https://t.co/lQU12Dm7vb 📑 arXiv: https://t.co/JI0kOh0yDk

2

34

12

47

8K

0

23

2

9

4K

Ran Xu @ritaranx

7 months ago

Thanks for featuring our work! 🙌

DAIR.AI

@dair_ai

7 months ago

4. TIR-Judge Google and collaborators introduce TIR-Judge, an end-to-end reinforcement learning framework that trains LLM judges to integrate code execution for precise evaluation. https://t.co/ePriZUpaN4

1

10

0

5

11K

0

16

1

3

9K

Ran Xu @ritaranx

7 months ago

@curlyhacks1 @Google @GoogleDeepMind @googlecloud Yes! Our framework natively supports multi-turn tool calling.

0

84

Who to follow

Yue Yu

@yue___yu

FAIR CodeGen @AIatMeta | Ex-Meta Llama | Alum @Tsinghua_Uni @GTCSE | NLP | Large Language Models

Yuchen Zhuang

@yuchen_zhuang

Research Scientist @GoogleDeepMind | Gemini Thinking & Coding | LLM Agent | Prev: PhD @MLatGT | Opinions are my own.

Data Mining Group@UIUC

@dmguiuc

led by Prof. Jiawei Han. Data Mining, AI, ML, NLP

Ran Xu @ritaranx

7 months ago

Happy to introduce my internship work at @Google and @GoogleDeepMind, collab w/ @googlecloud. We introduce TIR-Judge, an end-to-end agentic RL framework that trains LLM judges with tool-integrated reasoning 🧠🛠️ 🔗https://t.co/rtfqlvuzJ0 #Agents #LLMs #Judges #RL #reasoning

ritaranx's tweet photo. Happy to introduce my internship work at @Google and @GoogleDeepMind, collab w/ @googlecloud.

We introduce TIR-Judge, an end-to-end agentic RL framework that trains LLM judges with tool-integrated reasoning 🧠🛠️

🔗https://t.co/rtfqlvuzJ0
#Agents #LLMs #Judges #RL #reasoning https://t.co/xonnkQIlgy

13

519

68

349

46K

Ran Xu @ritaranx

7 months ago

@alpniks @Google @GoogleDeepMind @googlecloud Thanks! TIR-Judge is particularly effective for tasks that involve symbolic reasoning or calculation. For non-verifiable domains, we’ve also introduced a rubric-based framework in a recent paper to address evaluation in those cases: https://t.co/Ww3VqfL0EY

0

90

Ran Xu @ritaranx

7 months ago

Thanks for sharing our work on improving LLM judges with agentic RL!

Rohan Paul

@rohanpaul_ai

7 months ago

New Google paper trains LLM judges to use small bits of code alongside reasoning, so their decisions become precise. So judging stops being guesswork and becomes checkable. Text only judges often miscount, miss structure rules, or accept shaky logic that a simple program would catch. TIR-Judge makes the judge think step by step, write code to check claims, run it in a sandbox, then update the verdict. Training mixes tasks where code can verify answers and tasks where it cannot, so the judge learns when to call tools and when to rely on reasoning. One prompt schema covers pointwise scoring, pairwise choices, and listwise selection, so it plugs into many workflows. Reinforcement learning rewards being correct, following strict output tags, and using at most 3 tool calls. A variant called TIR-Judge-Zero skips teacher distillation and still improves by alternating reinforcement learning, rejection sampling, and supervised fine tuning. Across public judge benchmarks it beats text only judges, and with 8B it reaches 96% of Claude Opus 4 on listwise ranking. The core idea, give the judge verifiable checks plus rewards that favor careful tool use. ---- Paper – arxiv. org/abs/2510.23038 Paper Title: "Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning"

rohanpaul_ai's tweet photo. New Google paper trains LLM judges to use small bits of code alongside reasoning, so their decisions become precise.

So judging stops being guesswork and becomes checkable.

Text only judges often miscount, miss structure rules, or accept shaky logic that a simple program would catch.

TIR-Judge makes the judge think step by step, write code to check claims, run it in a sandbox, then update the verdict.

Training mixes tasks where code can verify answers and tasks where it cannot, so the judge learns when to call tools and when to rely on reasoning.

One prompt schema covers pointwise scoring, pairwise choices, and listwise selection, so it plugs into many workflows.

Reinforcement learning rewards being correct, following strict output tags, and using at most 3 tool calls.

A variant called TIR-Judge-Zero skips teacher distillation and still improves by alternating reinforcement learning, rejection sampling, and supervised fine tuning.

Across public judge benchmarks it beats text only judges, and with 8B it reaches 96% of Claude Opus 4 on listwise ranking.

The core idea, give the judge verifiable checks plus rewards that favor careful tool use.

----

Paper – arxiv. org/abs/2510.23038

Paper Title: "Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning"

3

163

17

116

14K

0

23

3

11

4K

Ran Xu @ritaranx

7 months ago

@Google @GoogleDeepMind @googlecloud 7/n Thanks to my collaborators: Jingjing Chen Jiayu Ye Yu Wu Jun Yan @jun_yannn Carl Yang @yangji9181 Hongkun Yu And thanks for the useful discussions from: Jing Nathan Yan @NathanYan2012 Yuchen Zhuang @yuchen_zhuang Zhengzhe Yang

0

7

1

0

802

Ran Xu @ritaranx

7 months ago

@Google @GoogleDeepMind @googlecloud 6/n 📊Best-of-N on Policy Models TIR-Judge is not only a better judge — it makes other models better. When selecting responses in best-of-N inference, TIR-Judge improves policy accuracy by +3.9~6.7% on AIME, BigCodeBench, IFEval, etc. → Better downstream reasoning too🎯

ritaranx's tweet photo. @Google @GoogleDeepMind @googlecloud 6/n
📊Best-of-N on Policy Models
TIR-Judge is not only a better judge — it makes other models better.
When selecting responses in best-of-N inference, TIR-Judge improves policy accuracy by +3.9~6.7% on AIME, BigCodeBench, IFEval, etc.
→ Better downstream reasoning too🎯 https://t.co/LJxNojfS2r

1

9

3

1

964

Ran Xu @ritaranx

8 months ago

n/n Thanks for our collaborators: Yuchen Zhuang @yuchen_zhuang Zihan Dong @zhiiiiaaaa Ruiyu Wang Yue Yu @yue___yu Joyce C. Ho @joycehoUT Linjun Zhang @linjunz_stat Haoyu Wang @haoyuwang0408 Wenqi Shi @WenqiShi0106 Carl Yang @yangji9181

0

4

0

411

Ran Xu @ritaranx

8 months ago

🚨 Happy to share AceSearcher accepted to #NeurIPS2025 #Spotlight! 🔹 One LLM, two roles: Decomposer (split queries) + Solver (combine context) 🔹 +7.6% on QA & fact verification 🔹 32B ≈ DeepSeek-V3 on DocMath 📂 Code: https://t.co/lQU12Dm7vb 📑 arXiv: https://t.co/JI0kOh0yDk

2

34

12

47

8K

Ran Xu @ritaranx

8 months ago

6/n Takeaways: ✅ With self-play frameworks: Smaller LLMs can rival giant proprietary models ✅ We can borrow the treasure from reasoning datasets to assist search in LLM and better couple search and reasoning ✅ Have the great potential for domains: finance, health, science

1

3

1

0

292

Ran Xu

@ritaranx

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users