This is probably the most entertaining way to understand one of AI’s hardest AI debates.
Transformer vs Post-Transformer, argued by leading researchers, inside a real physical boxing ring.
Both technically deep and genuinely entertaining.
I was glued for the entire 1 hour 20 minutes. So many super cool points to learn.
🥊 Transformers
- Transformers still own the present because they work at scale. They are simple, trainable, hardware-friendly, and already power the strongest AI systems we use today.
- The Transformer is basically a memory machine. It stores information as keys and values, then uses attention to pull back the most useful parts when answering.
- The real Transformer advantage is not just “attention.” The bigger advantage is that it fits modern hardware extremely well, so it can process huge batches of tokens fast.
- Scaling is still the brutal rule. If you give Transformers more compute, more data, and more parameters, they usually keep getting better. Any Post-Transformer architecture has to scale just as well, or better.
- It is not enough to look clever on small tests, because the real question is whether it improves faster than Transformers when scaled up.
- A replacement cannot be slightly better. Because the whole AI stack is already built around Transformers, the next architecture may need to be around 10x better to force everyone to switch.
- Transformers are powerful, but they may be brute force. A human does not need to read the entire internet many times to become smart, but current LLMs need enormous data and compute.
🥊 Post-Transformer
- Post-Transformer people are not saying Transformers are bad. They are saying Transformers may be the best current tool, not the final form of machine intelligence.
- The biggest Post-Transformer target is native reasoning and continual learning. Today’s LLM reasoning often feels like text-based step-by-step work added on top, instead of thinking happening naturally inside the model.
- Latent reasoning is one possible next step. That means the model reasons inside its own hidden internal space, instead of writing every thought out as words.
- Continual learning is still a major weakness. Humans keep learning from experience, but most Transformer-based models are trained, frozen, and then only adapt inside the prompt.
- Long context is not the same as real memory. A model can read a huge prompt, but that is different from building a life history, learning from mistakes, and updating beliefs over time.
- The future may be hybrid, not a clean replacement. Transformers may stay as 1 building block while newer systems add better memory, better reasoning, and better learning loops.
- The most interesting possibility is that Transformers may help discover their own successor. AI agents are already getting better at research and coding, so the next architecture may come from AI-assisted architecture search.
-------
- Benchmarks are a problem. Many public benchmarks are easy to game, so they may show leaderboard strength without proving deeper intelligence.
- Perplexity is still probably a great metric to evaluate frontier models,, because it tests prediction quality.
---
Overall, Transformers continue to dominate, but the frontier is clearly widening.
Pathway’s BDH (Dragon Hatchling — brain-inspired reasoning architecture), Sakana AI’s CTMs (Continuous Thought Machines — models that think over time), and Liquid AI’s LFMs (Liquid Foundation Models — efficient multimodal foundation models) - all of these show how the frontier is expanding.
---
From “Pathway (pathway[.]com)” Youtube channel (link in comment)
@zuzanna_pathway
Last week’s Post-Transformer debate post raised one question: Can long term memory become part of the architecture?
It points to one promising mathematical idea behind Post Transformer AI: Linear attention in high dimension with persistent state.
In a standard Transformer, memory is handled through caching context.
The model keeps previous keys and values in small dimension d, then attends over them. But this is still token history.
BDH (Dragon Hatchling) – one of the Post-Transformer architectures, takes a different route.
The paper describes BDH's state space as fixed and large, with the macro interpretation of associative memory, like KV cache, but organized differently.
Each layer has a persistent state matrix: ρₗ ∈ Rⁿˣᵈ
Here:
n = neuronal or concept dimension
d = low rank synaptic dimension
d << n
The key idea is that state is aligned to neurons, in high dimensional space (n in the order of billions).
A Transformer stores token history.Whereas BDH-GPU (a tensor-friendly version of the BDH architecture) evolves state, similar to State-Space Models.
This is where the brain analogy becomes useful. The brain does not append every experience into a longer transcript. It has a large bounded substrate of neurons and synapses, where experience changes connections sparsely and with high parallelism.
BDH GPU expresses a related idea computationally:
not memory as a longer context window,
but memory as a large, evolving internal state.
Why it matters:
– no Transformer style hard context window. practically enabling a infinite context window in a reasoning model.
– linear attention in a large neuronal dimension
– sparse positive activations
– persistent state instead of only token history
The deeper insight:
Long horizon reasoning may not come from storing more tokens.
It may very well come from better state dynamics.
Bigger models won't fix enterprise AI. Smarter architectures will.
Follow @zuzanna_pathway, CEO of Pathway, shaping the Post-Transformer shift toward continual learning and long-horizon reasoning.
Her article in Express Computer:
@Mithil27360@probnstat The architecture discussions are just catching on. There's progress already but mainstream frontier labs have to double down on their Transformer-scale moat as @YesThisIsLion said. It'll be the neolabs which will ship breakthroughs in production soon enough.
One deep learning debate every AI researcher should care about: Transformers vs Post Transformers.
At the surface, it sounds like an architecture fight. Mathematically, it is about scaling laws, memory, online learning in frontier models, and hardware limits.
That is what made the recent debate interesting. It featured @lukaszkaiser, @adrian_pathway, @YesThisIsLion, and @mlech26l, hosted by @zuzanna_pathway.
Transformers won the last era because multi head self attention scales empirically and fits the hardware ecosystem extremely well. But the next bottleneck may be different.
Full self attention has O(n²) compute pressure with sequence length. Transformer LLMs do not natively have persistent long-term memory. RAG retrieves. Longer context conditions. Neither necessarily forms new reasoning patterns inside the model.
That is why continual learning is becoming central, recently covered by @a16z.
The open questions:
– How can models learn after deployment without catastrophic forgetting?
– How can long term memory become part of the architecture?
– How can models reason over longer horizons without paying infinite context costs?
– How can hardware and AI architectures co-evolve more efficiently?
– And, are we chasing the right benchmarks with these goals in mind?
These questions were tackled head on, with counters from @lukaszkaiser, Transformer co-inventor and core contributor to ChatGPT and GPT models.
The image below summarizes some notes from the 80 minute debate.
Transformers unlocked massive productivity gains for startups and enterprises alike. But bigger models and more compute aren't solving the 95% failure rate in enterprise AI. The problem? An architecture with no memory.
Pathways BDH is rethinking the foundation — building a post-transformer approach to enterprise AI on AWS that learns and adapts over time.
What comes after the Transformer?
Zuzanna Stamirowska puts the debate out in the open, with the very inventors of Transformer and Post-Transformer architectures!
Watch the 5-minute highlights. Follow @zuzanna_pathway and hit the bell, full fight drops tomorrow.
Transformer vs Post-Transformer: The 5-minute KO compilation is live now. 🥊
@lukaszkaiser (co-invented Transformer & co-created ChatGPT)
@adrian_pathway (invented BDH and is CSO of Pathway)
@mlech26l (co-invented LNNs & is CTO of Liquid AI)
@YesThisIsLion (co-invented Transformer with Łukasz, now CTO of Sakana AI)
Moderated by @dexhorthy (CEO, HumanLayer) and me.
Full debate drops soon. Turn on notifications to catch the complete fight. This is the ultimate source of truth on the subject.
@zuzanna_pathway@YesThisIsLion Well said! In startups too, we obsess over leading indicators and foundational shifts instead of just chasing short-term metric wins. Same logic applies here.
Understanding the Post-Transformer era involves resetting focus towards what will take us to the next frontier!
But the future of AI shouldn’t be declared in a press release.
It should be argued in public by the people actually building it.
The decision will rest with an audience full of AI builders.
May 5, 5PM - San Francisco
@shivcodesai@zuzanna_pathway (Not seen the agenda btw). But perhaps, why not?
Continual learning & memory are among the single biggest bottlenecks in frontier AI given where we are today!