@ilyasut says there are more companies than ideas. @RichardSSutton the man who wrote the Bitter Lesson says LLMs aren't following it. @ylecun calls them a dead end.
The titans of the field are looking for the next paradigm. A new architecture. A missing primitive. What fundamental thing do these models lack?
The titans of yore said the same. In 1969, Minsky and Papert called neural networks a "sterile" direction of research. McCarthy, the man who coined "artificial intelligence" bet the field's future on symbolic logic instead. The field went dark for decades.
They were wrong. The idea was right. The scale was missing.
In 2012, @geoffreyhinton, @ilyasut, and Alex Krizhevsky took the same "dead" idea, added layers, and ran it on GPUs. AlexNet crushed ImageNet so hard the entire field pivoted overnight.
In 2018, transformers arrived one architecture that swallowed vision, audio, and language whole. Because it could scale.
In 2020, GPT-3 showed that scaling transformers to 175 billion parameters produced capabilities nobody programmed in translation, arithmetic, coding emerging from raw scale alone.
This keeps happening.
Each time: scale the right thing, and capabilities emerge that nobody predicted. Each time, right before the leap, the consensus said we'd hit a wall. When I saw a language model explain why a picture was funny, I knew @raykurzweil had been right all along. The pattern holds.
So when the consensus says it again, I pay attention to the pattern, not the consensus. (@DarioAmodei might agree he's said the current paradigm has more room than people think.)
They're modelling AI intelligence with human primitives: memory, goals, world models, continual learning. But neural networks aren't brains. They don't need to learn the way we learn. The Perceptron didn't need to be more brain-like. It needed to be bigger and denser.
Some problems come with a silver lining: a cheap, reliable way to check the answer. Code passes its tests or it doesn't. A proof verifies or it doesn't. When you have a checker that doesn't lie, you don't need the model to be brilliant on every attempt. You need the system (model + harness) to find the right answer across many attempts, then train the model on its own verified successes so it gets better each cycle.
4 things determine whether this works:
How good is the base model?
How reliable is the checker?
How many attempts can you afford?
How different are those attempts from each other?
Improvements to the same factor add. Improvements across different factors multiply.
If you can find independent axes of improvement each pushing a different factor modest gains don't accumulate. They compound. (Think of the jump reasoning models had over just pure pre-trained models.)
How many independent axes can you find? I count 8. Better yet, they form a flywheel.
Cheaper hardware and sparse architectures that route to only relevant parameters make each attempt cost less. Training stability ensures the model reaches its potential at extreme scale. That's the foundation a strong model that can afford thousands of attempts.
Memory gives each attempt context instant recall, searchable history, persistent expertise across sessions. Search diversity ensures those thousands of attempts aren't all the same wrong guess you recombine the best fragments across attempts into solutions better than any individual. Evolution, not auditions. Multi-agent coordination lets multiple copies attack from different angles, then folds the patterns back into one model. It learns to be a team inside a single mind.
Verification catches what worked and decides what the model learns from. Every approved solution becomes training data. Distillation trains the model on its own verified wins, so expensive search becomes cheap instincts.
Progress ≈ Base skill × Checker reliability × Attempt budget × Attempt diversity
If each axis gives even a 2x improvement and they're independent, eight axes isn't 16x. It's 256x.
A model faces a hard math problem. Cheap compute and sparse routing let it generate 1,000 diverse attempts. Verification catches the three with useful fragments. Those get synthesized. Distillation bakes it in. Next time: 50 attempts. Then 5. Then first try and the thousand attempts shift to something harder. Each round, the frontier moves. The system improves at improving.
Everything above is applying learned patterns to problems within reach.
Frontier problems are just built different.
The answer doesn't look like anything in the training data. There's a technique where the model keeps learning on a single hard problem at test time updating its own behavior, adapting to one challenge. This has already crossed a line: code optimizations twice as fast as the best human solutions. Improvements to math constructions that were open problems for decades.
Not solving. Discovering.
Can those research instincts: when to explore, when to cut losses, how to sense you're getting warmer; be distilled back permanently? If yes, the system learns to discover.
The Perceptron was right but couldn't scale. Neural networks were right but needed GPUs. Deep learning was right but needed transformers and data. Every time, the bet on scale and search won. Every time, the people building clever domain-specific solutions were outrun by the people who found the next axis of scale. This is the Bitter Lesson the most reliable pattern in AI history.
Like GSM outrunning technically superior CDMA in mobile networks, the current paradigm will likely entrench before alternatives mature. Not because it's optimal. Because it can scale.
We might not need a new idea. We might just need to finish the one we already have.
Full essay with technical details, equations, paper references, and caveats linked below.
https://t.co/DFwb4cb9Ha
The bitter lesson in 26 words:
Don’t be distracted by human knowledge, as AI has been historically.
Instead focus on methods for creating knowledge that scale with computation, like search and learning.
@zephyr_z9 let them gatekeep, I am awaiting the open source variants in 8 months, which will be served for cheap so we can ralph loop them into building everything in rust.