Excited to be at #NeurIPS through Dec 8 โ happy to connect!
Iโll be presenting our Spotlight paper on complex QA and reasoning with search:
๐๏ธ Dec 5, 11:00โ2:00pm PST
๐ Exhibit C/D/E โ Poster #1908
Also exploring full-time opportunitiesโDMs open if youโd like to chat!
4. TIR-Judge
Google and collaborators introduce TIR-Judge, an end-to-end reinforcement learning framework that trains LLM judges to integrate code execution for precise evaluation.
https://t.co/ePriZUpaN4
Happy to introduce my internship work at @Google and @GoogleDeepMind, collab w/ @googlecloud.
We introduce TIR-Judge, an end-to-end agentic RL framework that trains LLM judges with tool-integrated reasoning ๐ง ๐ ๏ธ
๐https://t.co/rtfqlvuzJ0
#Agents#LLMs#Judges#RL#reasoning
@alpniks@Google@GoogleDeepMind@googlecloud Thanks! TIR-Judge is particularly effective for tasks that involve symbolic reasoning or calculation. For non-verifiable domains, weโve also introduced a rubric-based framework in a recent paper to address evaluation in those cases: https://t.co/Ww3VqfL0EY
New Google paper trains LLM judges to use small bits of code alongside reasoning, so their decisions become precise.
So judging stops being guesswork and becomes checkable.
Text only judges often miscount, miss structure rules, or accept shaky logic that a simple program would catch.
TIR-Judge makes the judge think step by step, write code to check claims, run it in a sandbox, then update the verdict.
Training mixes tasks where code can verify answers and tasks where it cannot, so the judge learns when to call tools and when to rely on reasoning.
One prompt schema covers pointwise scoring, pairwise choices, and listwise selection, so it plugs into many workflows.
Reinforcement learning rewards being correct, following strict output tags, and using at most 3 tool calls.
A variant called TIR-Judge-Zero skips teacher distillation and still improves by alternating reinforcement learning, rejection sampling, and supervised fine tuning.
Across public judge benchmarks it beats text only judges, and with 8B it reaches 96% of Claude Opus 4 on listwise ranking.
The core idea, give the judge verifiable checks plus rewards that favor careful tool use.
----
Paper โ arxiv. org/abs/2510.23038
Paper Title: "Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning"
@Google@GoogleDeepMind@googlecloud 6/n
๐Best-of-N on Policy Models
TIR-Judge is not only a better judge โ it makes other models better.
When selecting responses in best-of-N inference, TIR-Judge improves policy accuracy by +3.9~6.7% on AIME, BigCodeBench, IFEval, etc.
โ Better downstream reasoning too๐ฏ
6/n Takeaways:
โ With self-play frameworks: Smaller LLMs can rival giant proprietary models
โ We can borrow the treasure from reasoning datasets to assist search in LLM and better couple search and reasoning
โ Have the great potential for domains: finance, health, science