1/ New from @ScaleAILabs: Rubrics (a.k.a. checklists) have become the default reward interface for RL on open-ended tasks without final verifiable answers.
But most rubric RL still relies on static aggregation: fixed human weights over criteria, summed into one scalar reward.
We show that this conflates what should matter in the final answer with what can actually teach the current policy.
https://t.co/H5wTQ27ulb
I don’t think I really recognized how much of that teenage girl living inside of me still held on to one direction as a safety and a comfort until now .