erwtokritos

@erwtokritos

Athens

Joined February 2009

1.1K Following

642 Followers

5.3K Posts

erwtokritos retweeted

Panathinaikos BC

@Paobcgr

14 days ago

«Σαν να μην πέρασε μια μέρα» Ο Παναθηναϊκός ανακοινώνει τον Ζέλικο Ομπράντοβιτς για τα επόμενα 3 χρόνια. Η ιστορία συνεχίζεται… #paobcaktor

214

125

erwtokritos retweeted

Andrej Karpathy

@karpathy

almost 2 years ago

# RLHF is just barely RL Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well. What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better: Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this: 1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse, 2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM. For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was. And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL. No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.

karpathy's tweet photo. # RLHF is just barely RL

Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well.

What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better:

Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this:

1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse,
2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM.

For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was.

And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL.

No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.

403

erwtokritos retweeted

Andrej Karpathy

@karpathy

over 2 years ago

# On the "hallucination problem" I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. We direct their dreams with prompts. The prompts start the dream, and based on the LLM's hazy recollection of its training documents, most of the time the result goes someplace useful. It's only when the dreams go into deemed factually incorrect territory that we label it a "hallucination". It looks like a bug, but it's just the LLM doing what it always does. At the other end of the extreme consider a search engine. It takes the prompt and just returns one of the most similar "training documents" it has in its database, verbatim. You could say that this search engine has a "creativity problem" - it will never respond with something new. An LLM is 100% dreaming and has the hallucination problem. A search engine is 0% dreaming and has the creativity problem. All that said, I realize that what people *actually* mean is they don't want an LLM Assistant (a product like ChatGPT etc.) to hallucinate. An LLM Assistant is a lot more complex system than just the LLM itself, even if one is at the heart of it. There are many ways to mitigate hallcuinations in these systems - using Retrieval Augmented Generation (RAG) to more strongly anchor the dreams in real data through in-context learning is maybe the most common one. Disagreements between multiple samples, reflection, verification chains. Decoding uncertainty from activations. Tool use. All an active and very interesting areas of research. TLDR I know I'm being super pedantic but the LLM has no "hallucination problem". Hallucination is not a bug, it is LLM's greatest feature. The LLM Assistant has a hallucination problem, and we should fix it. </rant> Okay I feel much better now :)

675

15K

erwtokritos @erwtokritos

over 2 years ago

Treating target determination (i.e. matching unit tests with PR code) as an Information Retrieval problem from the @PyTorch team https://t.co/oM4QQQvgca Quite interesting

118

Who to follow

Miles Brundage

@Miles_Brundage

AI policy researcher, @lfschiavo wife guy, fan of animals and sci-fi, executive director of AVERI (https://t.co/qq9xcmKQas), Substacker, views my own

fly51fly

@fly51fly

BUPT prof | Sharing latest AI papers & insights | Join me in embracing the AI revolution! #MachineLearning #AI #Innovation

Amy Heineike

@aheineike

Bring on the algorithms! Founding AI Engineer at Tessl Previously @ PrimerAI, 7bridges.

erwtokritos @erwtokritos

over 2 years ago

Rainn Wilson: "I was so unhappy during The Office!" (Dwight Schrute) https://t.co/INrt3IjCi6 μέσω @YouTube

erwtokritos @erwtokritos

almost 3 years ago

@Kessaris_ Γουέλ νταν! 👏👏

274

erwtokritos retweeted

Panathinaikos F.C.

@paofc_

almost 3 years ago

Τέλος! Πρόκριση! Γκολ ο Μλαντένοβιτς! Είμαστε στα play off! #Panathinaikos #PAOFC #OMPAO #UCL

125

641

251K

erwtokritos retweeted

François Chollet

@fchollet

almost 3 years ago

We're launching Keras Core, a new library that brings the Keras API to JAX and PyTorch in addition to TensorFlow. It enables you to write cross-framework deep learning components and to benefit from the best that each framework has to offer. Read more: https://t.co/xmmxBfSZgh

fchollet's tweet photo. We're launching Keras Core, a new library that brings the Keras API to JAX and PyTorch in addition to TensorFlow.

It enables you to write cross-framework deep learning components and to benefit from the best that each framework has to offer.

Read more: https://t.co/xmmxBfSZgh https://t.co/k5K22UZNdR

118

765

504

963K

erwtokritos @erwtokritos

about 3 years ago

@OpenAI 's Cookbook: Techniques to improve reliability https://t.co/lkrOtxVzJF

erwtokritos @erwtokritos

over 3 years ago

The Age of AI has begun https://t.co/KGqtGJL8h8

erwtokritos @erwtokritos

over 3 years ago

@SokratisVidros Congratulations👏

erwtokritos @erwtokritos

over 3 years ago

@Kessaris_ Well deserved!

479

erwtokritos retweeted

The New Yorker

@NewYorker

over 3 years ago

Can large-language models take the place of traditional search engines? https://t.co/IByMN28zb5

87K

erwtokritos retweeted

Aparadektoi Current Day

@aparadektoi1991

over 3 years ago

χρόνια πολλά ρε μαλάκα θανάση.

206

13K

erwtokritos retweeted

Aparadektoi Current Day

@aparadektoi1991

over 3 years ago

blue monday.

312

23K

erwtokritos retweeted

François Chollet

@fchollet

over 3 years ago

New tutorial on https://t.co/m6mT8SaHBD: fine-tuning Stable Diffusion on your own dataset. In this case, by the end of the tutorial you will be able to generate novel Pokemons :) https://t.co/SIcDEotLwO Created by @RisingSayak and @algo_diver 👍

514

200

100K

erwtokritos @erwtokritos

over 3 years ago

"Hacking Google" series 👏👍👌 https://t.co/yYEsgGrD3e

erwtokritos @erwtokritos

over 3 years ago

Amazon's Machine Learning University debuts responsible AI course https://t.co/ruUENJTEta

erwtokritos retweeted

Themis Kessaris

@Kessaris_

over 3 years ago

Η μέρα που το ποδόσφαιρο νίκησε, για πάντα.

349

110K

erwtokritos @erwtokritos

over 3 years ago

Εξαιρετική δουλειά as always 👏👏

Themis Kessaris

@Kessaris_

over 3 years ago

Αλήθειες και ψέματα της Super League Η απόλυτη ανάλυση του 1ου γύρου: - Η ομάδα σου σε άμυνα και επίθεση - Το αγωνιστικό στιλ όλων των ομάδων με την μπάλα και χωρίς - Οι κορυφαίοι παίκτες του open play https://t.co/raiRzUxhSD

Kessaris_'s tweet photo. Αλήθειες και ψέματα της Super League

Η απόλυτη ανάλυση του 1ου γύρου:
- Η ομάδα σου σε άμυνα και επίθεση
- Το αγωνιστικό στιλ όλων των ομάδων με την μπάλα και χωρίς
- Οι κορυφαίοι παίκτες του open play

https://t.co/raiRzUxhSD https://t.co/81lcsUuSdx

303

erwtokritos

@erwtokritos

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users