Pathway (www.pathway.com)

2 days ago

200K+ views and counting! The Transformer vs Post-Transformer debate, convened by @pathway_com Ft @lukaszkaiser, @adrian_pathway, @YesThisIsLion, @mlech26l, @dexhorthy, and me. Watch on YouTube: https://t.co/Yh2pbH1l7C Follow along for the next one.

1

11

5

5K

2 days ago

@zuzanna_pathway @lukaszkaiser @adrian_pathway @YesThisIsLion @mlech26l @dexhorthy First of many! 🐉🔥

0

1

0

93

Who to follow

CEO @pathway_com | Building post-transformer frontier AI | PhD, Complex Systems

Milvus

@milvusio

The most widely adopted open source vector database for #AI #OpenSource #VectorSearch 💬Discord: https://t.co/PHnys5NWdR 🔗Find us: https://t.co/BbTzkz9bHN

Claire Nouet

@claire_nouet

Co-founder @pathway_com. Shaking the foundations of AI by introducing the world’s first post-transformer model that adapts and thinks just like humans.

pathway_com retweeted

6 days ago

“We have not yet had a PageRank moment for intelligence.” We’ve got so many comments and questions about this statement delivered by @adrian_pathway during our recent Transformer vs Post-Transformer debate with @lukaszkaiser @YesThisIsLion @mlech26l - thanks! Let’s dig into it. In the 1990s, web search already existed. We could index information. AltaVista existed. The web was growing fast. Then PageRank happened. That moment combined three things: 1. A simple but deep mathematical idea: treat the web as a giant graph and compute a stationary distribution of a *random walk* on that *graph* 2. A scalable implementation: large-scale graph computation on huge clusters 3. A company that integrated and scaled the idea end-to-end: Google That combination gave search a much clearer center. It stopped being just a pile of heuristics and started to look more like: here is the mathematical object we need to compute, now let’s build the systems needed to compute it well. Adrian asked Lukasz Kaiser directly whether he sees a PageRank-like idea inside the Transformer. Lukasz said no. For intelligence, we still do not have that kind of unifying operator or process. We do not yet have an agreed mathematical object that says: this is the core computation behind it. That missing unifier is what Adrian meant by the absent “PageRank moment for intelligence.” That is also the main idea behind our work on BDH, our Post-Transformer architecture. We are after that fundamental “platform discovery” for intelligence. The full Transformer vs Post-Transformer debate is a good place to go deeper on these topics. Link below.

zuzanna_pathway's tweet photo. “We have not yet had a PageRank moment for intelligence.”

We’ve got so many comments and questions about this statement delivered by @adrian_pathway during our recent Transformer vs Post-Transformer debate with @lukaszkaiser @YesThisIsLion @mlech26l - thanks!

Let’s dig into it. In the 1990s, web search already existed. We could index information. AltaVista existed. The web was growing fast.

Then PageRank happened.

That moment combined three things:
1. A simple but deep mathematical idea: treat the web as a giant graph and compute a stationary distribution of a *random walk* on that *graph*
2. A scalable implementation: large-scale graph computation on huge clusters
3. A company that integrated and scaled the idea end-to-end: Google

That combination gave search a much clearer center. It stopped being just a pile of heuristics and started to look more like: here is the mathematical object we need to compute, now let’s build the systems needed to compute it well.

Adrian asked Lukasz Kaiser directly whether he sees a PageRank-like idea inside the

Transformer. Lukasz said no.

For intelligence, we still do not have that kind of unifying operator or process. We do not yet have an agreed mathematical object that says: this is the core computation behind it.

That missing unifier is what Adrian meant by the absent “PageRank moment for intelligence.”

That is also the main idea behind our work on BDH, our Post-Transformer architecture. We are after that fundamental “platform discovery” for intelligence.

The full Transformer vs Post-Transformer debate is a good place to go deeper on these topics. Link below.

2

14

5

0

307

6 days ago

@rohanpaul_ai Thanks for the sharing your key observations from the debate, @rohanpaul_ai! We're glad you found it useful!

1

4

0

132

6 days ago

Here's a great starting point for you to understand the Transformer vs Post Transformer Debate convened by @zuzanna_pathway! Credits @rohanpaul_ai.

Rohan Paul

@rohanpaul_ai

6 days ago

This is probably the most entertaining way to understand one of AI’s hardest AI debates. Transformer vs Post-Transformer, argued by leading researchers, inside a real physical boxing ring. Both technically deep and genuinely entertaining. I was glued for the entire 1 hour 20 minutes. So many super cool points to learn. 🥊 Transformers - Transformers still own the present because they work at scale. They are simple, trainable, hardware-friendly, and already power the strongest AI systems we use today. - The Transformer is basically a memory machine. It stores information as keys and values, then uses attention to pull back the most useful parts when answering. - The real Transformer advantage is not just “attention.” The bigger advantage is that it fits modern hardware extremely well, so it can process huge batches of tokens fast. - Scaling is still the brutal rule. If you give Transformers more compute, more data, and more parameters, they usually keep getting better. Any Post-Transformer architecture has to scale just as well, or better. - It is not enough to look clever on small tests, because the real question is whether it improves faster than Transformers when scaled up. - A replacement cannot be slightly better. Because the whole AI stack is already built around Transformers, the next architecture may need to be around 10x better to force everyone to switch. - Transformers are powerful, but they may be brute force. A human does not need to read the entire internet many times to become smart, but current LLMs need enormous data and compute. 🥊 Post-Transformer - Post-Transformer people are not saying Transformers are bad. They are saying Transformers may be the best current tool, not the final form of machine intelligence. - The biggest Post-Transformer target is native reasoning and continual learning. Today’s LLM reasoning often feels like text-based step-by-step work added on top, instead of thinking happening naturally inside the model. - Latent reasoning is one possible next step. That means the model reasons inside its own hidden internal space, instead of writing every thought out as words. - Continual learning is still a major weakness. Humans keep learning from experience, but most Transformer-based models are trained, frozen, and then only adapt inside the prompt. - Long context is not the same as real memory. A model can read a huge prompt, but that is different from building a life history, learning from mistakes, and updating beliefs over time. - The future may be hybrid, not a clean replacement. Transformers may stay as 1 building block while newer systems add better memory, better reasoning, and better learning loops. - The most interesting possibility is that Transformers may help discover their own successor. AI agents are already getting better at research and coding, so the next architecture may come from AI-assisted architecture search. ------- - Benchmarks are a problem. Many public benchmarks are easy to game, so they may show leaderboard strength without proving deeper intelligence. - Perplexity is still probably a great metric to evaluate frontier models,, because it tests prediction quality. --- Overall, Transformers continue to dominate, but the frontier is clearly widening. Pathway’s BDH (Dragon Hatchling — brain-inspired reasoning architecture), Sakana AI’s CTMs (Continuous Thought Machines — models that think over time), and Liquid AI’s LFMs (Liquid Foundation Models — efficient multimodal foundation models) - all of these show how the frontier is expanding. --- From “Pathway (pathway[.]com)” Youtube channel (link in comment) @zuzanna_pathway

7

101

23

100

89K

2

12

2

13

77K

Probability and Statistics

7 days ago

@probnstat Thanks for sharing, @probnstat! This is quite nicely put. It was one of the threads briefly covered in the Post-Transformer debate with the inventors behind these architectures. Worth a watch: https://t.co/dqUtsCYeN9

0

4

0

1

260

pathway_com retweeted

@probnstat

7 days ago

Last week’s Post-Transformer debate post raised one question: Can long term memory become part of the architecture? It points to one promising mathematical idea behind Post Transformer AI: Linear attention in high dimension with persistent state. In a standard Transformer, memory is handled through caching context. The model keeps previous keys and values in small dimension d, then attends over them. But this is still token history. BDH (Dragon Hatchling) – one of the Post-Transformer architectures, takes a different route. The paper describes BDH's state space as fixed and large, with the macro interpretation of associative memory, like KV cache, but organized differently. Each layer has a persistent state matrix: ρₗ ∈ Rⁿˣᵈ Here: n = neuronal or concept dimension d = low rank synaptic dimension d << n The key idea is that state is aligned to neurons, in high dimensional space (n in the order of billions). A Transformer stores token history.Whereas BDH-GPU (a tensor-friendly version of the BDH architecture) evolves state, similar to State-Space Models. This is where the brain analogy becomes useful. The brain does not append every experience into a longer transcript. It has a large bounded substrate of neurons and synapses, where experience changes connections sparsely and with high parallelism. BDH GPU expresses a related idea computationally: not memory as a longer context window, but memory as a large, evolving internal state. Why it matters: – no Transformer style hard context window. practically enabling a infinite context window in a reasoning model. – linear attention in a large neuronal dimension – sparse positive activations – persistent state instead of only token history The deeper insight: Long horizon reasoning may not come from storing more tokens. It may very well come from better state dynamics.

probnstat's tweet photo. Last week’s Post-Transformer debate post raised one question: Can long term memory become part of the architecture?

It points to one promising mathematical idea behind Post Transformer AI: Linear attention in high dimension with persistent state.

In a standard Transformer, memory is handled through caching context.

The model keeps previous keys and values in small dimension d, then attends over them. But this is still token history.

BDH (Dragon Hatchling) – one of the Post-Transformer architectures, takes a different route.

The paper describes BDH's state space as fixed and large, with the macro interpretation of associative memory, like KV cache, but organized differently.

Each layer has a persistent state matrix: ρₗ ∈ Rⁿˣᵈ

Here:

n = neuronal or concept dimension
d = low rank synaptic dimension
d << n

The key idea is that state is aligned to neurons, in high dimensional space (n in the order of billions).

A Transformer stores token history.Whereas BDH-GPU (a tensor-friendly version of the BDH architecture) evolves state, similar to State-Space Models.

This is where the brain analogy becomes useful. The brain does not append every experience into a longer transcript. It has a large bounded substrate of neurons and synapses, where experience changes connections sparsely and with high parallelism.

BDH GPU expresses a related idea computationally:

not memory as a longer context window,
but memory as a large, evolving internal state.

Why it matters:

– no Transformer style hard context window. practically enabling a infinite context window in a reasoning model.
– linear attention in a large neuronal dimension
– sparse positive activations
– persistent state instead of only token history

The deeper insight:

Long horizon reasoning may not come from storing more tokens.
It may very well come from better state dynamics.

7

112

25

64

7K

pathway_com retweeted

15 days ago

The full Transformer vs Post-Transformer debate is live. 80 minutes. Seven rounds. No slides. Real disagreement. @lukaszkaiser came to defend the Transformer. @adrian_pathway, @YesThisIsLion, and @mlech26l made the case for what comes next. 00:00 Contenders enter the ring 06:30 Lukasz Kaiser defends the Transformer 10:08 Adrian Kosowski on BDH and the PageRank Moment for AI 17:35 Llion Jones: Why Transformers aren't the final architecture 29:50 Mathias Lechner on Liquid AI’s approach, Fast Weights, and Self-Replacing AI 40:28 Reasoning Beyond Language 44:15 Scaling Laws: Transformer vs Post Transformer 50:31 Benchmarks, Coding Models, and Perplexity 1:04:00 Continual Learning and Dynamic Weights This is the ultimate source of truth on the subject.

17

210

22

137

1M

12 days ago

@zuzanna_pathway @AWSstartups Reasoning ➕ Long-term memory = your context, compounding.

0

1

0

38

Probability and Statistics

13 days ago

@probnstat Appreciate the writeup, @probnstat! The 80 minutes flew by because every one of those open questions deserves its own deep dive. Well, Post-Transformer era is here, and we'll have lots of opportunities in the near future.

0

4

0

1

72

pathway_com retweeted

@probnstat

13 days ago

One deep learning debate every AI researcher should care about: Transformers vs Post Transformers. At the surface, it sounds like an architecture fight. Mathematically, it is about scaling laws, memory, online learning in frontier models, and hardware limits. That is what made the recent debate interesting. It featured @lukaszkaiser, @adrian_pathway, @YesThisIsLion, and @mlech26l, hosted by @zuzanna_pathway. Transformers won the last era because multi head self attention scales empirically and fits the hardware ecosystem extremely well. But the next bottleneck may be different. Full self attention has O(n²) compute pressure with sequence length. Transformer LLMs do not natively have persistent long-term memory. RAG retrieves. Longer context conditions. Neither necessarily forms new reasoning patterns inside the model. That is why continual learning is becoming central, recently covered by @a16z. The open questions: – How can models learn after deployment without catastrophic forgetting? – How can long term memory become part of the architecture? – How can models reason over longer horizons without paying infinite context costs? – How can hardware and AI architectures co-evolve more efficiently? – And, are we chasing the right benchmarks with these goals in mind? These questions were tackled head on, with counters from @lukaszkaiser, Transformer co-inventor and core contributor to ChatGPT and GPT models. The image below summarizes some notes from the 80 minute debate.

probnstat's tweet photo. One deep learning debate every AI researcher should care about: Transformers vs Post Transformers.

At the surface, it sounds like an architecture fight. Mathematically, it is about scaling laws, memory, online learning in frontier models, and hardware limits.

That is what made the recent debate interesting. It featured @lukaszkaiser, @adrian_pathway, @YesThisIsLion, and @mlech26l, hosted by @zuzanna_pathway.

Transformers won the last era because multi head self attention scales empirically and fits the hardware ecosystem extremely well. But the next bottleneck may be different.

Full self attention has O(n²) compute pressure with sequence length. Transformer LLMs do not natively have persistent long-term memory. RAG retrieves. Longer context conditions. Neither necessarily forms new reasoning patterns inside the model.

That is why continual learning is becoming central, recently covered by @a16z.

The open questions:

– How can models learn after deployment without catastrophic forgetting?
– How can long term memory become part of the architecture?
– How can models reason over longer horizons without paying infinite context costs?
– How can hardware and AI architectures co-evolve more efficiently?
– And, are we chasing the right benchmarks with these goals in mind?

These questions were tackled head on, with counters from @lukaszkaiser, Transformer co-inventor and core contributor to ChatGPT and GPT models.

The image below summarizes some notes from the 80 minute debate.

5

77

24

65

6K

pathway_com retweeted

NYU Tandon @nyutandon

13 days ago

Computer Science & Engineering Department Chair Martín Farach-Colton hit the red carpet alongside his mentee @zuzanna_pathway , whose company @pathway_com was recognized as one of @FastCompany 's Most Innovative Companies. #NYUTandonMade

nyutandon's tweet photo. Computer Science & Engineering Department Chair Martín Farach-Colton hit the red carpet alongside his mentee @zuzanna_pathway , whose company @pathway_com was recognized as one of @FastCompany 's Most Innovative Companies.

#NYUTandonMade https://t.co/PtPPAZFzMX

1

12

3

0

476

pathway_com retweeted

Sakana AI

@SakanaAILabs

14 days ago

先日サンフランシスコで開催された討論会「Transformers vs Post-Transformers」に、Sakana AIの共同創業者兼CTOであるLlion Jones @YesThisIsLion が登壇しました。本イベントは、現在のAI界を牽引するアーキテクチャ「トランスフォーマー」について、論文共著者を含む4人が、トランスフォーマー支持と、継続学習や潜在空間での推論を武器にその次を見据える「ポスト・トランスフォーマー」の支持に分かれ、これからのAIの未来をどちらが形作るのかを深く議論する場となりました。その中でLlionは、トランスフォーマーの原論文の共著者でありながら、現在のトランスフォーマーの有用性は十分に認めつつも、あえてポスト・トランスフォーマー側に立ち、その先のアーキテクチャの可能性を論じる役割を担いました。 Llionは、現在のトランスフォーマーの成功は構造そのものによるものではなく、並列処理に優れたハードウェア（GPU/TPU）に適応できたことによる「計算資源の力技」による側面が大きいと分析。それと並行して全く異なる前提に立つアーキテクチャを探る重要性を提起しました。さらに、今後の研究コミュニティに対して、既存のベンチマークや現在のハードウェアの制約から解放されるべきだと提唱。「次の革新的なアーキテクチャは、初期段階ではトランスフォーマーより遅く、精度も劣るかもしれない。しかし、それを恐れずに全く異なる前提のシステムを探求すべきだ」と、研究姿勢そのものの変革を訴えました。 Sakana AIはトランスフォーマーをベースとした研究開発と並行して、次世代アーキテクチャの探求にも研究にも取り組んでおり、Llion自身が関わっている、生物学的な脳に倣った新アーキテクチャであるContinous Thought Machine（CTM）などはその一例です。刺激的な議論の場を提供してくださった主催者の皆様、そして登壇者の皆様に心より感謝申し上げます。当日の討論会の様子は、こちらからご覧いただけます： https://t.co/k2cWjAkO8w 🐟 @zuzanna_pathway

SakanaAILabs's tweet photo. 先日サンフランシスコで開催された討論会「Transformers vs Post-Transformers」に、Sakana AIの共同創業者兼CTOであるLlion Jones @YesThisIsLion が登壇しました。

本イベントは、現在のAI界を牽引するアーキテクチャ「トランスフォーマー」について、論文共著者を含む4人が、トランスフォーマー支持と、継続学習や潜在空間での推論を武器にその次を見据える「ポスト・トランスフォーマー」の支持に分かれ、これからのAIの未来をどちらが形作るのかを深く議論する場となりました。

その中でLlionは、トランスフォーマーの原論文の共著者でありながら、現在のトランスフォーマーの有用性は十分に認めつつも、あえてポスト・トランスフォーマー側に立ち、その先のアーキテクチャの可能性を論じる役割を担いました。

Llionは、現在のトランスフォーマーの成功は構造そのものによるものではなく、並列処理に優れたハードウェア（GPU/TPU）に適応できたことによる「計算資源の力技」による側面が大きいと分析。それと並行して全く異なる前提に立つアーキテクチャを探る重要性を提起しました。

さらに、今後の研究コミュニティに対して、既存のベンチマークや現在のハードウェアの制約から解放されるべきだと提唱。「次の革新的なアーキテクチャは、初期段階ではトランスフォーマーより遅く、精度も劣るかもしれない。しかし、それを恐れずに全く異なる前提のシステムを探求すべきだ」と、研究姿勢そのものの変革を訴えました。

Sakana AIはトランスフォーマーをベースとした研究開発と並行して、次世代アーキテクチャの探求にも研究にも取り組んでおり、Llion自身が関わっている、生物学的な脳に倣った新アーキテクチャであるContinous Thought Machine（CTM）などはその一例です。

刺激的な議論の場を提供してくださった主催者の皆様、そして登壇者の皆様に心より感謝申し上げます。

当日の討論会の様子は、こちらからご覧いただけます：
https://t.co/k2cWjAkO8w
🐟 @zuzanna_pathway

5

184

28

85

32K

pathway_com retweeted

Llion Jones @YesThisIsLion

15 days ago

This was so much fun!!

1

10

4

2

1K

pathway_com retweeted

dex

@dexhorthy

15 days ago

never has the ai research world encountered so much whimsy. great hanging with @zuzanna_pathway @adrian_pathway @YesThisIsLion @mlech26l @lukaszkaiser and learning about what comes after transformers https://t.co/JqWPyF3KFk

3

21

1

6

3K

15 days ago

The conversation that anchored last week's debate is now public. Not blog posts. Not Twitter threads. Four researchers who wrote the foundational papers, making the case for what shapes the Post-Transformer era. 📍Transformer vs Post-Transformer: The Deciding Round | May 2026 | San Francisco

15 days ago

The full Transformer vs Post-Transformer debate is live. 80 minutes. Seven rounds. No slides. Real disagreement. @lukaszkaiser came to defend the Transformer. @adrian_pathway, @YesThisIsLion, and @mlech26l made the case for what comes next. 00:00 Contenders enter the ring 06:30 Lukasz Kaiser defends the Transformer 10:08 Adrian Kosowski on BDH and the PageRank Moment for AI 17:35 Llion Jones: Why Transformers aren't the final architecture 29:50 Mathias Lechner on Liquid AI’s approach, Fast Weights, and Self-Replacing AI 40:28 Reasoning Beyond Language 44:15 Scaling Laws: Transformer vs Post Transformer 50:31 Benchmarks, Coding Models, and Perplexity 1:04:00 Continual Learning and Dynamic Weights This is the ultimate source of truth on the subject.

17

210

22

137

1M

1

14

2

7

406K

pathway_com retweeted

AWS Startups

@AWSstartups

15 days ago

Transformers unlocked massive productivity gains for startups and enterprises alike. But bigger models and more compute aren't solving the 95% failure rate in enterprise AI. The problem? An architecture with no memory. Pathways BDH is rethinking the foundation — building a post-transformer approach to enterprise AI on AWS that learns and adapts over time.

1

11

4

2

1K