The paradox with evolutionary systems is that legacy learning is hard to update fast, which is both a good thing and a bad.
Bad that it can't improvise to updated data fast , good that a small unintended change won't significantly change the outcomes...
# RLHF is just barely RL
Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well.
What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better:
Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this:
1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse,
2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM.
For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was.
And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL.
No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.
Huge congrats to @AIatMeta on the Llama 3.1 release!
Few notes:
Today, with the 405B model release, is the first time that a frontier-capability LLM is available to everyone to work with and build on. The model appears to be GPT-4 / Claude 3.5 Sonnet grade and the weights are open and permissively licensed, including commercial use, synthetic data generation, distillation and finetuning. This is an actual, open, frontier-capability LLM release from Meta. The release includes a lot more, e.g. including a 92-page PDF with a lot of detail about the model:
https://t.co/48e3YJ8Sg9
The philosophy underlying this release is in this longread from Zuck, well worth reading as it nicely covers all the major points and arguments in favor of the open AI ecosystem worldview:
"Open Source AI is the Path Forward"
https://t.co/AdmpadCRM0
I like to say that it is still very early days, that we are back in the ~1980s of computing all over again, that LLMs are a next major computing paradigm, and Meta is clearly positioning itself to be the open ecosystem leader of it.
- People will prompt and RAG the models.
- People will finetune the models.
- People will distill them into smaller expert models for narrow tasks and applications.
- People will study, benchmark, optimize.
Open ecosystems also self-organize in modular ways into products apps and services, where each party can contribute their own unique expertise. One example from this morning is @GroqInc , who built a new chip that inferences LLMs *really fast*. They've already integrated Llama 3.1 models and appear to be able to inference the 8B model ~instantly:
https://t.co/b2kdSsz0fH
And (I can't seem to try it due to server pressure) the 405B running on Groq is probably the highest capability, fastest LLM today (?).
Early model evaluations look good:
https://t.co/RLR5YBpmks https://t.co/ipT4x4wCvy
Pending still is the "vibe check", look out for that on X / r/LocalLlama over the next few days (hours?).
I expect the closed model players (which imo have a role in the ecosystem too) to give chase soon, and I'm looking forward to that.
There's a lot to like on the technical side too, w.r.t. multilingual, context lengths, function calling, multimodal, etc. I'll post about some of the technical notes a bit later, once I make it through all the 92 pages of the paper :)
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"
Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.
Introducing AlphaGeometry: an AI system that solves Olympiad geometry problems at a level approaching a human gold-medalist. 📐
It was trained solely on synthetic data and marks a breakthrough for AI in mathematical reasoning. 🧵 https://t.co/g3RFSoWNPP
# On the "hallucination problem"
I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
We direct their dreams with prompts. The prompts start the dream, and based on the LLM's hazy recollection of its training documents, most of the time the result goes someplace useful.
It's only when the dreams go into deemed factually incorrect territory that we label it a "hallucination". It looks like a bug, but it's just the LLM doing what it always does.
At the other end of the extreme consider a search engine. It takes the prompt and just returns one of the most similar "training documents" it has in its database, verbatim. You could say that this search engine has a "creativity problem" - it will never respond with something new. An LLM is 100% dreaming and has the hallucination problem. A search engine is 0% dreaming and has the creativity problem.
All that said, I realize that what people *actually* mean is they don't want an LLM Assistant (a product like ChatGPT etc.) to hallucinate. An LLM Assistant is a lot more complex system than just the LLM itself, even if one is at the heart of it. There are many ways to mitigate hallcuinations in these systems - using Retrieval Augmented Generation (RAG) to more strongly anchor the dreams in real data through in-context learning is maybe the most common one. Disagreements between multiple samples, reflection, verification chains. Decoding uncertainty from activations. Tool use. All an active and very interesting areas of research.
TLDR I know I'm being super pedantic but the LLM has no "hallucination problem". Hallucination is not a bug, it is LLM's greatest feature. The LLM Assistant has a hallucination problem, and we should fix it.
</rant> Okay I feel much better now :)
Google (DeepMind) releases AI model Gemini.
There is no turning back now, we are in for one mad ride. The multi modality, and fluidity of the model is super clean.
My jaw dropped at 4:24 seconds
A thread...
In my decade spent on AI, I've never seen an algorithm that so many people fantasize about. Just from a name, no paper, no stats, no product. So let's reverse engineer the Q* fantasy. VERY LONG READ:
To understand the powerful marriage between Search and Learning, we need to go back to 2016 and revisit AlphaGo, a glorious moment in the AI history.
It's got 4 key ingredients:
1. Policy NN (Learning): responsible for selecting good moves. It estimates the probability of each move leading to a win.
2. Value NN (Learning): evaluates the board and predicts the winner from any given legal position in Go.
3. MCTS (Search): stands for "Monte Carlo Tree Search". It simulates many possible sequences of moves from the current position using the policy NN, and then aggregates the results of these simulations to decide on the most promising move. This is the "slow thinking" component that contrasts with the fast token sampling of LLMs.
4. A groundtruth signal to drive the whole system. In Go, it's as simple as the binary label "who wins", which is decided by an established set of game rules. You can think of it as a source of energy that *sustains* the learning progress.
How do the components above work together?
AlphaGo does self-play, i.e. playing against its own older checkpoints. As self-play continues, both Policy NN and Value NN are improved iteratively: as the policy gets better at selecting moves, the value NN obtains better data to learn from, and in turn it provides better feedback to the policy. A stronger policy also helps MCTS explore better strategies.
That completes an ingenious "perpetual motion machine". In this way, AlphaGo was able to bootstrap its own capabilities and beat the human world champion, Lee Sedol, 4-1 in 2016. An AI can never become super-human just by imitating human data alone.
-----
Now let's talk about Q*. What are the corresponding 4 components?
1. Policy NN: this will be OAI's most powerful internal GPT, responsible for actually implementing the thought traces that solve a math problem.
2. Value NN: another GPT that scores how likely each intermediate reasoning step is correct.
OAI published a paper in May 2023 called "Let's Verify Step by Step", coauthored by big names like @ilyasut@johnschulman2@janleike: https://t.co/iAvXNjjhcK
It's much lesser known than DALL-E or Whipser, but gives us quite a lot of hints.
This paper proposes "Process-supervised Reward Models", or PRMs, that gives feedback for each step in the chain-of-thought. In contrast, "Outcome-supervised reward models", or ORMs, only judge the entire output at the end.
ORMs are the original reward model formulation for RLHF, but it's too coarse-grained to properly judge the sub-parts of a long response. In other words, ORMs are not great for credit assignment. In RL literature, we call ORMs "sparse reward" (only given once at the end), and PRMs "dense reward" that smoothly shapes the LLM to our desired behavior.
3. Search: unlike AlphaGo's discrete states and actions, LLMs operate on a much more sophisticated space of "all reasonable strings". So we need new search procedures.
Expanding on Chain of Thought (CoT), the research community has developed a few nonlinear CoTs:
- Tree of Thought: literally combining CoT and tree search: https://t.co/KM1P2ZJrjG @ShunyuYao12
- Graph of Thought: yeah you guessed it already. Turn the tree into a graph and Voilà! You get an even more sophisticated search operator: https://t.co/5ncT5tuTOY
4. Groundtruth signal: a few possibilities:
(a) Each math problem comes with a known answer. OAI may have collected a huge corpus from existing math exams or competitions.
(b) The ORM itself can be used as a groundtruth signal, but then it could be exploited and "loses energy" to sustain learning.
(c) A formal verification system, such as Lean Theorem Prover, can turn math into a coding problem and provide compiler feedbacks: https://t.co/vpOBOI2FR5
And just like AlphaGo, the Policy LLM and Value LLM can improve each other iteratively, as well as learn from human expert annotations whenever available. A better Policy LLM will help the Tree of Thought Search explore better strategies, which in turn collect better data for the next round.
@demishassabis said a while back that DeepMind Gemini will use "AlphaGo-style algorithms" to boost reasoning. Even if Q* is not what we think, Google will certainly catch up with their own. If I can think of the above, they surely can.
Note that what I described is just about reasoning. Nothing says Q* will be more creative in writing poetry, telling jokes @grok, or role playing. Improving creativity is a fundamentally human thing, so I believe natural data will still outperform synthetic ones.
I welcome any thoughts or feedback!!
Runway's new update is producing incredible AI videos. It's a significant leap forward.
As someone who's worked with a famous Hollywood producer and dreamed of creating films, I find this so exciting. We're witnessing the birth of a new era in film.
Here are the best examples:
Bias and Variance takes little time to learn but lifetime to master, and at the end it will be at most somewhat an optimized outcome. Applies pretty much to everything in life.
Will be quite useful for case based algorithm development and selection and training. Somewhat like more sophisticated and generalized version of cross validation at a scale for new models, put in rustic terms. Super Amazing
🎉Just released: Eureka!, a new AI agent that uses LLMs to automatically generate algorithms to train robots to accomplish complex tasks.
👀 The #NVIDIAResearch paper includes the AI algorithms and how to experiment with Eureka using NVIDIA Isaac Gym.
👇 https://t.co/Mw7gCY9Urz
Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
https://t.co/rTiKZhEjXM
We model the world as a set of 3D Gaussians that move & rotate over time. This extends Gaussian Splatting to dynamic scenes, with accurate novel-view synthesis and dense 3D trajectories.
Around 31 hours since Adobe Max dropped the new Firefly 2 models...
People are WOW-ing at the real photorealism!
10 examples:
(How are these not real photos?)
This is such an interesting work. Video diffusion model is being used as a data-driven physics simulation, in which an agent can plan, explore, and learn optimal actions without touching robot hardware or causing harm.
LLM is not only an OS, but also a full reality simulator.