New preprint📣
Typical reinforcement learning policy gradient algorithms target the mean reward E[R], while deployment often cares about other properties of the reward distribution: pass@k, max@k, tail risk like CVaR, robust metrics like medians, etc.
We introduce OrderGrad, a method that can flexibly optimize any of these targets via a one line of code reward transformation. Everything else about your code, whether you use GRPO, PPO, REINFORCE can remain unchanged.
Arxiv: https://t.co/aOdAMI5J8S
Code: https://t.co/nDssyjRl5D
🥇🥈🥉OrderGrad is based on order-statistic estimation. Specifically, consider a batch of K sampled rewards and sort them:
R_(1:K) < R_(2:K) < … < R_(K:K)
Now apply weights a_i and take the expected value at each rank:
Sum_i a_i * E[R_(i:K)]
This allows flexibly defining different objectives that target different regions of the reward distribution. Notably, putting all of the weight on the top rank becomes Pass@K / Max@K, but our approach generalizes this to arbitrary ranks. You can target TopM@K, Medians, CVaR, Winsorized means, or any other weighting of your choosing.
The order-statistics connect back to the original distribution in the sense that the j-th order-statistic corresponds roughly to the j/(K+1) quantile of the reward distribution (see the right figure). As K becomes large, the order-statistics converge to the CDF, so essentially, putting weights on the order-statistics is equivalent to weighting different regions of the reward distribution.
Our main contribution is an unbiased gradient estimator for the weighted order-statistic objective when the batch size is N and the subset size for ranking K. Increasing K improves the CDF approximation, but also increases variance (a classical bias-variance tradeoff). We give an estimator in both REINFORCE policy gradient and in reparameterized backpropagation form. Computation time is negligible (<1ms).
I still want to improve the preprint, so comments and suggestions are very welcome. The code is available so please try it out! 🙏
Many thanks to my collaborators:
Paavo Parmas
Yongmin Kim
Kohsei Matsutani
Shota Takashiro
Soichiro Nishimori
Takeshi Kojima
Yusuke Iwasawa
Yutaka Matsuo