😢RLVR is powerful but expensive
🤯Imagine using <20% RLVR training while achieving 100% performance?
Sounds surprising? We show that minimal RLVR training is enough to know where training is going, and predict future ckpts at no training cost!
📃https://t.co/fGODWWIjR1
🧵[1/n]
🎉 Honored to receive the @CapitalOne PhD Fellowship!
Many thanks to my advisor @yumeng0818 and my collaborators for their guidance and support throughout my PhD journey at @CS_UVA@UVAEngineers! 💙🧡
Excited to continue building more capable, reliable, and efficient AI systems!
https://t.co/LJzPFsFiz5
😢RLVR is powerful but expensive
🤯Imagine using <20% RLVR training while achieving 100% performance?
Sounds surprising? We show that minimal RLVR training is enough to know where training is going, and predict future ckpts at no training cost!
📃https://t.co/fGODWWIjR1
🧵[1/n]
The paper and accompanying artifacts are now released — including 500+ RLVR checkpoints for studying training dynamics and extrapolation! 🥳🥳
📚 Paper: https://t.co/olkSYHFAHb
📝 Blog: https://t.co/H9xWxD6dlZ
💻 Code: https://t.co/0ZF1WBlfAr
🤗 Checkpoints: https://t.co/Uj4OrbpoQl
😢RLVR is powerful but expensive
🤯Imagine using <20% RLVR training while achieving 100% performance?
Sounds surprising? We show that minimal RLVR training is enough to know where training is going, and predict future ckpts at no training cost!
📃https://t.co/fGODWWIjR1
🧵[1/n]
Can process reward models know when NOT to trust themselves? 🤔
We introduce BetaPRM: a distributional PRM that predicts both step-level success probability and the reliability of that prediction.
Instead of only asking “how good is this step?”, BetaPRM also asks:
“how confident am I?” 🔍
Not yet. The current evaluation focuses on math tasks since the RLVR training domain is solely math. It would definitely be interesting to see whether the extrapolated checkpoints generalize better (or regress less) than fully RLVR-tuned models on other domains.
What non-math tasks would you suggest we evaluate on?😃
📢 Takeaway: You only need minimal RLVR training to know where the model is heading.
Observe the early training dynamics, then go extrapolate future checkpoints at no training cost!
Blogpost👇
https://t.co/UWUJ1CB8tR
🧵[10/n]