Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL.
This implementation, however, is quite common in open source RL repos and recent research papers.
In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.
Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL.
This implementation, however, is quite common in open source RL repos and recent research papers.
In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.
Taking the k3 estimate as an example (from John's popular blogpost https://t.co/7jMc9iDivF). Contrary to popular practice, differentiating the estimate as a loss ends up enforcing the reverse-KL, but only incidentally.
See more details: https://t.co/OAdzTqjsFT
It was refreshing to see the impact that small algorithmic changes have on the system performance.
While the “double-sided” PPO/GRPO clipping is dominant in the literature, we argue that a single-sided clipping akin to IMPALA fits the design of distributed training more.
Introducing LlamaRL, a distributed RL framework for training LLM at scale.
LlamaRL is highly modular, Pytorch-native, customizes optimization of actors/learners to max out throughput, and adjusts for systemic off-policyness to stabilize training
https://t.co/oXjCEdh2lS
@c_voelcker@charlinelelan Many thanks @c_voelcker for the kind words! Very glad that our past investigation can be of help to your exciting new study here, look forward to reading in more details!
Eventually, humans will need to supervise superhuman AI - but how? Can we study it now?
We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself.
Does this work? It’s complicated: 🧵👇
@PandaAshwinee Thanks! Sorry completely missed the reply here...
Indeed H2 is quite surprising. I think it's mainly bc contrastive losses don't work well w/ offline data. That is pi(y_w) / pi(y_l) can increase while both pi(y_w) and pi(y_l) are low. If we change to Bo2, H2 is less prominent.
Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF.
We hypothesis test the causes of the perf gap between online and offline alignment. https://t.co/fJ0731MHLF
Details in🧵
Thanks @_akhaliq for promoting our work!
Unlike regular RL where golden r(s,a) are available and online is generally deemed better than offline, in RLHF this is less clear.
Complementary to some concurrent work, we investigate causes to the perf gap between online vs. offline.
Understanding the performance gap between online and offline alignment algorithms
Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need
The findings ought to be taken with a grain of salt due to limitations in our experimental setups. But hopefully this investigation contributes to a better understanding of RLHF practices.
Finally, very grateful to my collaborators @GoogleDeepMind on this fun project!
Some takeaways:
- There is something more to online than wider coverage of response generation
- Offline training improves policy is a much more implicit way than online (discriminative vs. generative abilities)
- The gap persists across wider variants of algos and network sizes
Fast-forward ⏩ alignment research from @GoogleDeepMind ! Our latest results enhance alignment outcomes in Large Language Models (LLMs). Presenting NashLLM!