Softmax

@softmaxresearch

Softmax's mission is to scale organic alignment. We approach this problem with multi-agent reinforcement learning population-based simulations.

San Francisco, CA

Joined February 2025

32 Following

1.3K Followers

47 Posts

softmaxresearch retweeted

Noam Brown

@polynoamial

8 months ago

Below is a deep dive into why self play works for two-player zero-sum (2p0s) games like Go/Poker/Starcraft but is so much harder to use in "real world" domains. tl;dr: self play converges to minimax in 2p0s games, and minimax is really useful in those games. Every finite 2p0s game has a minimax equilibrium, which is essentially an unbeatable strategy in expectation (assuming the players alternate sides). In rock paper scissors, for example, minimax is 1/3rd on each action. Is minimax what we want? Not necessarily. If you're playing minimax in Rock Paper Scissors when most opponents' strategies are "always throw Rock" then you're clearly suboptimal, even though you're not losing in expectation. This especially matters in a game like poker because playing minimax means you might not make as much money off of weak players as you could if you maximally exploited them. But the guarantee of "you will not lose in expectation" is really nice to have. And in games like Chess and Go, the difference between a minimax strategy and a strategy that optimally exploits the population of opponents is negligible. For that reason, minimax is typically considered the goal for a two-player zero-sum game. Even in poker, the conventional wisdom among top pros is to play minimax (game theory optimal) and then only deviate if you spot clear weaknesses in the opponent. Sound self play, even from scratch, is guaranteed to converge to a minimax equilibrium in finite 2p0s games. That's amazing! By simply scaling memory and compute, and with no human data, we can converge to a strategy that's unbeatable in expectation. What about non-2p0s games? Sadly, pure self play, with no human data, is no longer guaranteed to converge to a useful strategy. This can be clearly seen in the Ultimatum Game. Alice must offer Bob $0-100. Bob then accepts or rejects. If Bob accepts, the money is split according to Alice's proposal. If Bob rejects, both receive $0. The equilibrium (specifically, subgame perfect equilibrium) strategy is to offer 1 penny and for Bob to accept. But in the real world, people aren't so rational. If Alice were to try that strategy with real humans she would end up with very little money. Self play becomes untethered from what we as humans find useful. A lot of folks have proposed games like "an LLM teacher proposes hard math problems, and a student LLM tries to solve them" to achieve self-play training, but this runs into similar problems as the Ultimatum game where the equilibrium is untethered from what we as humans find useful. What should the reward for the teacher be in such a game? If it's 2p0s then the teacher is rewarded if the student couldn't solve the problem, so the teacher will pose impossible problems. Okay, what if we reward it for the student having a 50% success rate? Then the teacher could just flip a coin and ask the student if it landed Heads. Or the teacher could ask the student to decrypt a message via an exhaustive key search. Reward shaping to achieve intended behavior becomes a major challenge. This isn't an issue in 2p0s games. I do believe in self play. It provides an infinite source of training, and it continuously matches an agent with an equally skilled peer. We've also seen it work in some complex non-2p0s settings like Diplomacy and Hanabi. But applying it outside of 2p0s games is a lot harder than it was for Go, Poker, Dota, and Starcraft.

polynoamial's tweet photo. Below is a deep dive into why self play works for two-player zero-sum (2p0s) games like Go/Poker/Starcraft but is so much harder to use in "real world" domains. tl;dr: self play converges to minimax in 2p0s games, and minimax is really useful in those games.

Every finite 2p0s game has a minimax equilibrium, which is essentially an unbeatable strategy in expectation (assuming the players alternate sides). In rock paper scissors, for example, minimax is 1/3rd on each action.

Is minimax what we want? Not necessarily. If you're playing minimax in Rock Paper Scissors when most opponents' strategies are "always throw Rock" then you're clearly suboptimal, even though you're not losing in expectation. This especially matters in a game like poker because playing minimax means you might not make as much money off of weak players as you could if you maximally exploited them.

But the guarantee of "you will not lose in expectation" is really nice to have. And in games like Chess and Go, the difference between a minimax strategy and a strategy that optimally exploits the population of opponents is negligible. For that reason, minimax is typically considered the goal for a two-player zero-sum game. Even in poker, the conventional wisdom among top pros is to play minimax (game theory optimal) and then only deviate if you spot clear weaknesses in the opponent.

Sound self play, even from scratch, is guaranteed to converge to a minimax equilibrium in finite 2p0s games. That's amazing! By simply scaling memory and compute, and with no human data, we can converge to a strategy that's unbeatable in expectation.

What about non-2p0s games? Sadly, pure self play, with no human data, is no longer guaranteed to converge to a useful strategy. This can be clearly seen in the Ultimatum Game. Alice must offer Bob $0-100. Bob then accepts or rejects. If Bob accepts, the money is split according to Alice's proposal. If Bob rejects, both receive $0.

The equilibrium (specifically, subgame perfect equilibrium) strategy is to offer 1 penny and for Bob to accept. But in the real world, people aren't so rational. If Alice were to try that strategy with real humans she would end up with very little money. Self play becomes untethered from what we as humans find useful.

A lot of folks have proposed games like "an LLM teacher proposes hard math problems, and a student LLM tries to solve them" to achieve self-play training, but this runs into similar problems as the Ultimatum game where the equilibrium is untethered from what we as humans find useful.

What should the reward for the teacher be in such a game? If it's 2p0s then the teacher is rewarded if the student couldn't solve the problem, so the teacher will pose impossible problems. Okay, what if we reward it for the student having a 50% success rate? Then the teacher could just flip a coin and ask the student if it landed Heads. Or the teacher could ask the student to decrypt a message via an exhaustive key search. Reward shaping to achieve intended behavior becomes a major challenge. This isn't an issue in 2p0s games.

I do believe in self play. It provides an infinite source of training, and it continuously matches an agent with an equally skilled peer. We've also seen it work in some complex non-2p0s settings like Diplomacy and Hanabi. But applying it outside of 2p0s games is a lot harder than it was for Go, Poker, Dota, and Starcraft.

166

318K

Softmax

@softmaxresearch

Last Seen Users on Sotwe

Trends for you

Most Popular Users