@oraclekev Great pick, @oraclekev! You might also like Łukasz Kaiser in this debate with his peer inventors of Transformer and Post-Transformer architectures:
https://t.co/dqUtsCYeN9
This is probably the most entertaining way to understand one of AI’s hardest AI debates.
Transformer vs Post-Transformer, argued by leading researchers, inside a real physical boxing ring.
Both technically deep and genuinely entertaining.
I was glued for the entire 1 hour 20 minutes. So many super cool points to learn.
🥊 Transformers
- Transformers still own the present because they work at scale. They are simple, trainable, hardware-friendly, and already power the strongest AI systems we use today.
- The Transformer is basically a memory machine. It stores information as keys and values, then uses attention to pull back the most useful parts when answering.
- The real Transformer advantage is not just “attention.” The bigger advantage is that it fits modern hardware extremely well, so it can process huge batches of tokens fast.
- Scaling is still the brutal rule. If you give Transformers more compute, more data, and more parameters, they usually keep getting better. Any Post-Transformer architecture has to scale just as well, or better.
- It is not enough to look clever on small tests, because the real question is whether it improves faster than Transformers when scaled up.
- A replacement cannot be slightly better. Because the whole AI stack is already built around Transformers, the next architecture may need to be around 10x better to force everyone to switch.
- Transformers are powerful, but they may be brute force. A human does not need to read the entire internet many times to become smart, but current LLMs need enormous data and compute.
🥊 Post-Transformer
- Post-Transformer people are not saying Transformers are bad. They are saying Transformers may be the best current tool, not the final form of machine intelligence.
- The biggest Post-Transformer target is native reasoning and continual learning. Today’s LLM reasoning often feels like text-based step-by-step work added on top, instead of thinking happening naturally inside the model.
- Latent reasoning is one possible next step. That means the model reasons inside its own hidden internal space, instead of writing every thought out as words.
- Continual learning is still a major weakness. Humans keep learning from experience, but most Transformer-based models are trained, frozen, and then only adapt inside the prompt.
- Long context is not the same as real memory. A model can read a huge prompt, but that is different from building a life history, learning from mistakes, and updating beliefs over time.
- The future may be hybrid, not a clean replacement. Transformers may stay as 1 building block while newer systems add better memory, better reasoning, and better learning loops.
- The most interesting possibility is that Transformers may help discover their own successor. AI agents are already getting better at research and coding, so the next architecture may come from AI-assisted architecture search.
-------
- Benchmarks are a problem. Many public benchmarks are easy to game, so they may show leaderboard strength without proving deeper intelligence.
- Perplexity is still probably a great metric to evaluate frontier models,, because it tests prediction quality.
---
Overall, Transformers continue to dominate, but the frontier is clearly widening.
Pathway’s BDH (Dragon Hatchling — brain-inspired reasoning architecture), Sakana AI’s CTMs (Continuous Thought Machines — models that think over time), and Liquid AI’s LFMs (Liquid Foundation Models — efficient multimodal foundation models) - all of these show how the frontier is expanding.
---
From “Pathway (pathway[.]com)” Youtube channel (link in comment)
@zuzanna_pathway
“We have not yet had a PageRank moment for intelligence.”
We’ve got so many comments and questions about this statement delivered by @adrian_pathway during our recent Transformer vs Post-Transformer debate with @lukaszkaiser@YesThisIsLion@mlech26l - thanks!
Let’s dig into it. In the 1990s, web search already existed. We could index information. AltaVista existed. The web was growing fast.
Then PageRank happened.
That moment combined three things:
1. A simple but deep mathematical idea: treat the web as a giant graph and compute a stationary distribution of a *random walk* on that *graph*
2. A scalable implementation: large-scale graph computation on huge clusters
3. A company that integrated and scaled the idea end-to-end: Google
That combination gave search a much clearer center. It stopped being just a pile of heuristics and started to look more like: here is the mathematical object we need to compute, now let’s build the systems needed to compute it well.
Adrian asked Lukasz Kaiser directly whether he sees a PageRank-like idea inside the
Transformer. Lukasz said no.
For intelligence, we still do not have that kind of unifying operator or process. We do not yet have an agreed mathematical object that says: this is the core computation behind it.
That missing unifier is what Adrian meant by the absent “PageRank moment for intelligence.”
That is also the main idea behind our work on BDH, our Post-Transformer architecture. We are after that fundamental “platform discovery” for intelligence.
The full Transformer vs Post-Transformer debate is a good place to go deeper on these topics. Link below.
This is probably the most entertaining way to understand one of AI’s hardest AI debates.
Transformer vs Post-Transformer, argued by leading researchers, inside a real physical boxing ring.
Both technically deep and genuinely entertaining.
I was glued for the entire 1 hour 20 minutes. So many super cool points to learn.
🥊 Transformers
- Transformers still own the present because they work at scale. They are simple, trainable, hardware-friendly, and already power the strongest AI systems we use today.
- The Transformer is basically a memory machine. It stores information as keys and values, then uses attention to pull back the most useful parts when answering.
- The real Transformer advantage is not just “attention.” The bigger advantage is that it fits modern hardware extremely well, so it can process huge batches of tokens fast.
- Scaling is still the brutal rule. If you give Transformers more compute, more data, and more parameters, they usually keep getting better. Any Post-Transformer architecture has to scale just as well, or better.
- It is not enough to look clever on small tests, because the real question is whether it improves faster than Transformers when scaled up.
- A replacement cannot be slightly better. Because the whole AI stack is already built around Transformers, the next architecture may need to be around 10x better to force everyone to switch.
- Transformers are powerful, but they may be brute force. A human does not need to read the entire internet many times to become smart, but current LLMs need enormous data and compute.
🥊 Post-Transformer
- Post-Transformer people are not saying Transformers are bad. They are saying Transformers may be the best current tool, not the final form of machine intelligence.
- The biggest Post-Transformer target is native reasoning and continual learning. Today’s LLM reasoning often feels like text-based step-by-step work added on top, instead of thinking happening naturally inside the model.
- Latent reasoning is one possible next step. That means the model reasons inside its own hidden internal space, instead of writing every thought out as words.
- Continual learning is still a major weakness. Humans keep learning from experience, but most Transformer-based models are trained, frozen, and then only adapt inside the prompt.
- Long context is not the same as real memory. A model can read a huge prompt, but that is different from building a life history, learning from mistakes, and updating beliefs over time.
- The future may be hybrid, not a clean replacement. Transformers may stay as 1 building block while newer systems add better memory, better reasoning, and better learning loops.
- The most interesting possibility is that Transformers may help discover their own successor. AI agents are already getting better at research and coding, so the next architecture may come from AI-assisted architecture search.
-------
- Benchmarks are a problem. Many public benchmarks are easy to game, so they may show leaderboard strength without proving deeper intelligence.
- Perplexity is still probably a great metric to evaluate frontier models,, because it tests prediction quality.
---
Overall, Transformers continue to dominate, but the frontier is clearly widening.
Pathway’s BDH (Dragon Hatchling — brain-inspired reasoning architecture), Sakana AI’s CTMs (Continuous Thought Machines — models that think over time), and Liquid AI’s LFMs (Liquid Foundation Models — efficient multimodal foundation models) - all of these show how the frontier is expanding.
---
From “Pathway (pathway[.]com)” Youtube channel (link in comment)
@zuzanna_pathway
@probnstat Thanks for sharing, @probnstat! This is quite nicely put. It was one of the threads briefly covered in the Post-Transformer debate with the inventors behind these architectures. Worth a watch:
https://t.co/dqUtsCYeN9
Last week’s Post-Transformer debate post raised one question: Can long term memory become part of the architecture?
It points to one promising mathematical idea behind Post Transformer AI: Linear attention in high dimension with persistent state.
In a standard Transformer, memory is handled through caching context.
The model keeps previous keys and values in small dimension d, then attends over them. But this is still token history.
BDH (Dragon Hatchling) – one of the Post-Transformer architectures, takes a different route.
The paper describes BDH's state space as fixed and large, with the macro interpretation of associative memory, like KV cache, but organized differently.
Each layer has a persistent state matrix: ρₗ ∈ Rⁿˣᵈ
Here:
n = neuronal or concept dimension
d = low rank synaptic dimension
d << n
The key idea is that state is aligned to neurons, in high dimensional space (n in the order of billions).
A Transformer stores token history.Whereas BDH-GPU (a tensor-friendly version of the BDH architecture) evolves state, similar to State-Space Models.
This is where the brain analogy becomes useful. The brain does not append every experience into a longer transcript. It has a large bounded substrate of neurons and synapses, where experience changes connections sparsely and with high parallelism.
BDH GPU expresses a related idea computationally:
not memory as a longer context window,
but memory as a large, evolving internal state.
Why it matters:
– no Transformer style hard context window. practically enabling a infinite context window in a reasoning model.
– linear attention in a large neuronal dimension
– sparse positive activations
– persistent state instead of only token history
The deeper insight:
Long horizon reasoning may not come from storing more tokens.
It may very well come from better state dynamics.
The full Transformer vs Post-Transformer debate is live.
80 minutes. Seven rounds. No slides. Real disagreement.
@lukaszkaiser came to defend the Transformer. @adrian_pathway, @YesThisIsLion, and @mlech26l made the case for what comes next.
00:00 Contenders enter the ring
06:30 Lukasz Kaiser defends the Transformer
10:08 Adrian Kosowski on BDH and the PageRank Moment for AI
17:35 Llion Jones: Why Transformers aren't the final architecture
29:50 Mathias Lechner on Liquid AI’s approach, Fast Weights, and Self-Replacing AI
40:28 Reasoning Beyond Language
44:15 Scaling Laws: Transformer vs Post Transformer
50:31 Benchmarks, Coding Models, and Perplexity
1:04:00 Continual Learning and Dynamic Weights
This is the ultimate source of truth on the subject.
@probnstat Appreciate the writeup, @probnstat! The 80 minutes flew by because every one of those open questions deserves its own deep dive. Well, Post-Transformer era is here, and we'll have lots of opportunities in the near future.
One deep learning debate every AI researcher should care about: Transformers vs Post Transformers.
At the surface, it sounds like an architecture fight. Mathematically, it is about scaling laws, memory, online learning in frontier models, and hardware limits.
That is what made the recent debate interesting. It featured @lukaszkaiser, @adrian_pathway, @YesThisIsLion, and @mlech26l, hosted by @zuzanna_pathway.
Transformers won the last era because multi head self attention scales empirically and fits the hardware ecosystem extremely well. But the next bottleneck may be different.
Full self attention has O(n²) compute pressure with sequence length. Transformer LLMs do not natively have persistent long-term memory. RAG retrieves. Longer context conditions. Neither necessarily forms new reasoning patterns inside the model.
That is why continual learning is becoming central, recently covered by @a16z.
The open questions:
– How can models learn after deployment without catastrophic forgetting?
– How can long term memory become part of the architecture?
– How can models reason over longer horizons without paying infinite context costs?
– How can hardware and AI architectures co-evolve more efficiently?
– And, are we chasing the right benchmarks with these goals in mind?
These questions were tackled head on, with counters from @lukaszkaiser, Transformer co-inventor and core contributor to ChatGPT and GPT models.
The image below summarizes some notes from the 80 minute debate.
Computer Science & Engineering Department Chair Martín Farach-Colton hit the red carpet alongside his mentee @zuzanna_pathway , whose company @pathway_com was recognized as one of @FastCompany 's Most Innovative Companies.
#NYUTandonMade
The conversation that anchored last week's debate is now public.
Not blog posts. Not Twitter threads.
Four researchers who wrote the foundational papers, making the case for what shapes the Post-Transformer era.
📍Transformer vs Post-Transformer: The Deciding Round | May 2026 | San Francisco
The full Transformer vs Post-Transformer debate is live.
80 minutes. Seven rounds. No slides. Real disagreement.
@lukaszkaiser came to defend the Transformer. @adrian_pathway, @YesThisIsLion, and @mlech26l made the case for what comes next.
00:00 Contenders enter the ring
06:30 Lukasz Kaiser defends the Transformer
10:08 Adrian Kosowski on BDH and the PageRank Moment for AI
17:35 Llion Jones: Why Transformers aren't the final architecture
29:50 Mathias Lechner on Liquid AI’s approach, Fast Weights, and Self-Replacing AI
40:28 Reasoning Beyond Language
44:15 Scaling Laws: Transformer vs Post Transformer
50:31 Benchmarks, Coding Models, and Perplexity
1:04:00 Continual Learning and Dynamic Weights
This is the ultimate source of truth on the subject.
Transformers unlocked massive productivity gains for startups and enterprises alike. But bigger models and more compute aren't solving the 95% failure rate in enterprise AI. The problem? An architecture with no memory.
Pathways BDH is rethinking the foundation — building a post-transformer approach to enterprise AI on AWS that learns and adapts over time.