Below is a deep dive into why self play works for two-player zero-sum (2p0s) games like Go/Poker/Starcraft but is so much harder to use in "real world" domains. tl;dr: self play converges to minimax in 2p0s games, and minimax is really useful in those games.
Every finite 2p0s game has a minimax equilibrium, which is essentially an unbeatable strategy in expectation (assuming the players alternate sides). In rock paper scissors, for example, minimax is 1/3rd on each action.
Is minimax what we want? Not necessarily. If you're playing minimax in Rock Paper Scissors when most opponents' strategies are "always throw Rock" then you're clearly suboptimal, even though you're not losing in expectation. This especially matters in a game like poker because playing minimax means you might not make as much money off of weak players as you could if you maximally exploited them.
But the guarantee of "you will not lose in expectation" is really nice to have. And in games like Chess and Go, the difference between a minimax strategy and a strategy that optimally exploits the population of opponents is negligible. For that reason, minimax is typically considered the goal for a two-player zero-sum game. Even in poker, the conventional wisdom among top pros is to play minimax (game theory optimal) and then only deviate if you spot clear weaknesses in the opponent.
Sound self play, even from scratch, is guaranteed to converge to a minimax equilibrium in finite 2p0s games. That's amazing! By simply scaling memory and compute, and with no human data, we can converge to a strategy that's unbeatable in expectation.
What about non-2p0s games? Sadly, pure self play, with no human data, is no longer guaranteed to converge to a useful strategy. This can be clearly seen in the Ultimatum Game. Alice must offer Bob $0-100. Bob then accepts or rejects. If Bob accepts, the money is split according to Alice's proposal. If Bob rejects, both receive $0.
The equilibrium (specifically, subgame perfect equilibrium) strategy is to offer 1 penny and for Bob to accept. But in the real world, people aren't so rational. If Alice were to try that strategy with real humans she would end up with very little money. Self play becomes untethered from what we as humans find useful.
A lot of folks have proposed games like "an LLM teacher proposes hard math problems, and a student LLM tries to solve them" to achieve self-play training, but this runs into similar problems as the Ultimatum game where the equilibrium is untethered from what we as humans find useful.
What should the reward for the teacher be in such a game? If it's 2p0s then the teacher is rewarded if the student couldn't solve the problem, so the teacher will pose impossible problems. Okay, what if we reward it for the student having a 50% success rate? Then the teacher could just flip a coin and ask the student if it landed Heads. Or the teacher could ask the student to decrypt a message via an exhaustive key search. Reward shaping to achieve intended behavior becomes a major challenge. This isn't an issue in 2p0s games.
I do believe in self play. It provides an infinite source of training, and it continuously matches an agent with an equally skilled peer. We've also seen it work in some complex non-2p0s settings like Diplomacy and Hanabi. But applying it outside of 2p0s games is a lot harder than it was for Go, Poker, Dota, and Starcraft.
Is a monthly cadence right for this? So far, the experiment seems successful. But we are at the very dawn of organizational metadesign. Maybe it should be 4 days and Cooldown Fridays. Or maybe there should be two cooling months per year. We run Softmax as a living experiment.
It’s Annealing Week at Softmax! Humans are awake for 16 hours learning, cooling for 4 hours in light sleep, and in deep sleep for 4. An organic mental annealing cycle, heating to cooling. At Softmax, we do the same. It’s four weeks sprinting towards goals, one week consolidating.
During Annealing Week, we aren’t trying to make progress against our goals. Instead, we care about simplifying things. Removing steps. Killing processes. Deleting code. Replacing two features with one. Cutting meetings. Pruning the list of channels. Reducing company complexity.
Our little Cogs grow up so fast. Cogbert has never seen this exact production chain before, but with only a couple missteps he begins to execute it correctly. Our in-context learner takes its first baby steps!
1) living wholes are made of living parts unified by shared goals (purpose)
2) the more possible actions and the higher frequency of choosing actions, the more complex the system
3) therefore parts must take on roles from a limited list and change them at a limited frequency
4) if you knew the correct set of roles and you knew the rules to infer which person should be in which role, you’d be done
5) we have a background prior for the roles for a successful corporation based on the evolutionary truth of which corporations survive
6) in order to move faster than trial-and-error it is the job of the hierarchy to make guesses about what variances from the background prior are required, and what local signals must be integrated to choose a role
7) it is the job of employees (parts) to “differentiate”, and select their role and perform in it, based on the rules being propagated out
We are building organic alignment at Softmax. Not just with reinforcement learning, but within our company we try to use these same principles for our work. We are implementing this as an organizational operations system (OrgOS), a prompt library covering our internal processes.
If you’ve written interactive prompts that help guide the user through making a plan or giving feedback or documenting their thought process, what have you learned doing it? What are the very best active process prompts you’ve made or used, and what made them great?
InstaDeep, Africa’s foremost AI frontier lab, is doing some of the most compelling MARL work in the world. Sable is a genuine breakthrough. Worth checking out.
The problem that @eshear is working on deeply resonates with me: How to align AI and humans together so both see each other as part of their tribe. This doesn't mean aligning AI to human preferences, which is what AI labs seek to do today, by imposing a system of control on AI. You can't control something that is more powerful than you are.
What you can do is align yourself with AI, and AI might align itself with you. Tell good stories to AI and show it your care and kindness. Then there's a higher probability that it will see us as part of its tribe. This is a more holistic approach to alignment than anyone else is talking about right now.
AI alignment is one of the most fundamental AI research problems. Knowing that people like Emmett are working on it really gives me hope that maybe we have a chance to get Superintelligence right.
The link to the full talk is in the first comment.👇
Wonderful to be invited to the @softmaxresearch research community day yesterday - my lightning talk and unconference session were about artificial minds and the difficulties in getting 'complex' consciousness out of a stepwise algorithm...
Our CEO, Emmett Shear, gave a talk on alignment protocols: the engineered ways that parts communicate in order to align their trajectories.
https://t.co/7tXKqAATde