Yunhao (Robin) Tang @robinphysics - Twitter Profile

Pinned Tweet

12 months ago

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

robinphysics's tweet photo. Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL.

This implementation, however, is quite common in open source RL repos and recent research papers.

In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad. https://t.co/19DfZu0zi8

15

658

54

608

71K

Yunhao (Robin) Tang @robinphysics

12 months ago

@canaesseth @xidulu Indeed! Sorry the post misses the nuance in this regard. It was mostly referring to some very specific recent RL implementations.

0

3

0

195

Yunhao (Robin) Tang @robinphysics

12 months ago

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

15

658

54

608

71K

Yunhao (Robin) Tang @robinphysics

12 months ago

@y0b1byte Thanks so much for the kind words!

0

1

0

80

Who to follow

Rishabh Agarwal

@agarwl_

Reinforcement Learner @periodiclabs, Adjunct Prof at McGill. Ex Meta, DeepMind, Brain, @iitbombay. NeurIPS Best Paper, On-Policy Distillation

RL Theory Virtual Seminars

@RLtheory

Virtual seminar series featuring the latest advances in theoretical reinforcement learning. Seminars (approximately) every Tuesday at 6pm UTC.

Jakob Foerster

@j_foerst

Associate Prof in ML @UniofOxford. Something Something Research Scientist @MetaAI. Something @FLAIR_Ox. Always #teamhuman. Opinions belong to the world.

Yunhao (Robin) Tang @robinphysics

12 months ago

Taking the k3 estimate as an example (from John's popular blogpost https://t.co/7jMc9iDivF). Contrary to popular practice, differentiating the estimate as a loss ends up enforcing the reverse-KL, but only incidentally. See more details: https://t.co/OAdzTqjsFT

0

33

6

35

5K

Yunhao (Robin) Tang @robinphysics

12 months ago

It was refreshing to see the impact that small algorithmic changes have on the system performance. While the “double-sided” PPO/GRPO clipping is dominant in the literature, we argue that a single-sided clipping akin to IMPALA fits the design of distributed training more.

robinphysics's tweet photo. It was refreshing to see the impact that small algorithmic changes have on the system performance.

While the “double-sided” PPO/GRPO clipping is dominant in the literature, we argue that a single-sided clipping akin to IMPALA fits the design of distributed training more. https://t.co/UQdQRZV9yh

0

15

0

5

1K

Yunhao (Robin) Tang @robinphysics

12 months ago

Introducing LlamaRL, a distributed RL framework for training LLM at scale. LlamaRL is highly modular, Pytorch-native, customizes optimization of actors/learners to max out throughput, and adjusts for systemic off-policyness to stabilize training https://t.co/oXjCEdh2lS

robinphysics's tweet photo. Introducing LlamaRL, a distributed RL framework for training LLM at scale.

LlamaRL is highly modular, Pytorch-native, customizes optimization of actors/learners to max out throughput, and adjusts for systemic off-policyness to stabilize training

https://t.co/oXjCEdh2lS https://t.co/bEpPB6c8Eq

4

298

47

235

28K

Yunhao (Robin) Tang @robinphysics

almost 2 years ago

@c_voelcker @charlinelelan Many thanks @c_voelcker for the kind words! Very glad that our past investigation can be of help to your exciting new study here, look forward to reading in more details!

0

5

0

109

robinphysics retweeted

Zac Kenton @ZacKenton1

almost 2 years ago

Eventually, humans will need to supervise superhuman AI - but how? Can we study it now? We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself. Does this work? It’s complicated: 🧵👇

ZacKenton1's tweet photo. Eventually, humans will need to supervise superhuman AI - but how? Can we study it now?

We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself.

Does this work? It’s complicated: 🧵👇 https://t.co/tc79PjOBnz

5

241

57

163

53K

Yunhao (Robin) Tang @robinphysics

almost 2 years ago

@PandaAshwinee Thanks! Sorry completely missed the reply here... Indeed H2 is quite surprising. I think it's mainly bc contrastive losses don't work well w/ offline data. That is pi(y_w) / pi(y_l) can increase while both pi(y_w) and pi(y_l) are low. If we change to Bo2, H2 is less prominent.

0

39

Yunhao (Robin) Tang @robinphysics

about 2 years ago

Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF. We hypothesis test the causes of the perf gap between online and offline alignment. https://t.co/fJ0731MHLF Details in🧵

robinphysics's tweet photo. Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF.

We hypothesis test the causes of the perf gap between online and offline alignment. https://t.co/fJ0731MHLF

Details in🧵 https://t.co/gBAS3JbH3N

3

71

15

41

11K

Yunhao (Robin) Tang @robinphysics

about 2 years ago

Thanks @_akhaliq for promoting our work! Unlike regular RL where golden r(s,a) are available and online is generally deemed better than offline, in RLHF this is less clear. Complementary to some concurrent work, we investigate causes to the perf gap between online vs. offline.

AK

@_akhaliq

about 2 years ago

Understanding the performance gap between online and offline alignment algorithms Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need

_akhaliq's tweet photo. Understanding the performance gap between online and offline alignment algorithms

Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need https://t.co/UOgu1uXUUe

2

74

24

49

13K

0

16

4

5

2K

Yunhao (Robin) Tang @robinphysics

about 2 years ago

The findings ought to be taken with a grain of salt due to limitations in our experimental setups. But hopefully this investigation contributes to a better understanding of RLHF practices. Finally, very grateful to my collaborators @GoogleDeepMind on this fun project!

1

6

0

552

Yunhao (Robin) Tang @robinphysics

about 2 years ago

Some takeaways: - There is something more to online than wider coverage of response generation - Offline training improves policy is a much more implicit way than online (discriminative vs. generative abilities) - The gap persists across wider variants of algos and network sizes

1

4

0

559

robinphysics retweeted

Michal Valko

@misovalko

over 2 years ago

Fast-forward ⏩ alignment research from @GoogleDeepMind ! Our latest results enhance alignment outcomes in Large Language Models (LLMs). Presenting NashLLM!

misovalko's tweet photo. Fast-forward ⏩ alignment research from @GoogleDeepMind ! Our latest results enhance alignment outcomes in Large Language Models (LLMs). Presenting NashLLM! https://t.co/50O5zVnA8V

4

791

126

509

193K

Yunhao (Robin) Tang

@robinphysics

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users