What will the learning environments of the future look like that train artificial super intelligence? In recent work at @scale_AI , we show that training systems that combine verifiable rewards with multi-agent interaction accelerate learning.
Great to see our work, Rubrics as Rewards, featured in the latest RLHF Book update ๐๐
Rubric-based RLVR is emerging as a practical tool for modern training and evaluation. See ยง13.4 at https://t.co/uuhUIUBvNE. ๐
๐New @scale_AI paper: ๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฅ๐๐ฏ๐ฟ๐ถ๐ฐ๐, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <๐ฒ๐ด% ๐ฟ๐๐ฏ๐ฟ๐ถ๐ฐ ๐ฐ๐ผ๐บ๐ฝ๐น๐ถ๐ฎ๐ป๐ฐ๐ฒ. We built ๐ฎ.๐ฑ๐+ expert rubrics with ๐ฎ.๐ด๐+ hrs of human labor to measure why.
๐ Introducing SWE-Bench Pro โ a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks.
This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.
๐ค How do we train LLMs on real-world tasks where itโs hard to define a single verifiable answer?
Our work at @scale_AI introduces Rubrics as Rewards (RaR) โ a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. ๐งต
@karpathy a neat quality specific to language models is that you can just tell them what to do differently when they fail. And if you use importance sampling, gradients are aligned with the unguided context and it gets into the weights directly. No sleep needed
https://t.co/qJ2Qv43rYp
For online RL, we introduce Guide, a class of algorithms which incorporate guidance into the modelโs context when all rollouts fail and adjusts the importance sampling ratio in order to optimize the policy for contexts in which guidance is no longer present.
New @Scale_AI paper! ๐
LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)โteaching models to say when they're reward hackingโdramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
What will the learning environments of the future look like that train artificial super intelligence? In recent work at @scale_AI , we show that training systems that combine verifiable rewards with multi-agent interaction accelerate learning.
Weโre entering a new era in robotics where generalized systems are starting to work in the real world, but researchers still donโt have good tools for understanding their data. Thatโs why I built ARES, an open-source platform for ingesting, annotating, and curating robotics data.