Mudit Srivastava

4 days ago

200K+ views and counting! The Transformer vs Post-Transformer debate, convened by @pathway_com Ft @lukaszkaiser, @adrian_pathway, @YesThisIsLion, @mlech26l, @dexhorthy, and me. Watch on YouTube: https://t.co/Yh2pbH1l7C Follow along for the next one.

muditjps retweeted

Rohan Paul

@rohanpaul_ai

7 days ago

This is probably the most entertaining way to understand one of AI’s hardest AI debates. Transformer vs Post-Transformer, argued by leading researchers, inside a real physical boxing ring. Both technically deep and genuinely entertaining. I was glued for the entire 1 hour 20 minutes. So many super cool points to learn. 🥊 Transformers - Transformers still own the present because they work at scale. They are simple, trainable, hardware-friendly, and already power the strongest AI systems we use today. - The Transformer is basically a memory machine. It stores information as keys and values, then uses attention to pull back the most useful parts when answering. - The real Transformer advantage is not just “attention.” The bigger advantage is that it fits modern hardware extremely well, so it can process huge batches of tokens fast. - Scaling is still the brutal rule. If you give Transformers more compute, more data, and more parameters, they usually keep getting better. Any Post-Transformer architecture has to scale just as well, or better. - It is not enough to look clever on small tests, because the real question is whether it improves faster than Transformers when scaled up. - A replacement cannot be slightly better. Because the whole AI stack is already built around Transformers, the next architecture may need to be around 10x better to force everyone to switch. - Transformers are powerful, but they may be brute force. A human does not need to read the entire internet many times to become smart, but current LLMs need enormous data and compute. 🥊 Post-Transformer - Post-Transformer people are not saying Transformers are bad. They are saying Transformers may be the best current tool, not the final form of machine intelligence. - The biggest Post-Transformer target is native reasoning and continual learning. Today’s LLM reasoning often feels like text-based step-by-step work added on top, instead of thinking happening naturally inside the model. - Latent reasoning is one possible next step. That means the model reasons inside its own hidden internal space, instead of writing every thought out as words. - Continual learning is still a major weakness. Humans keep learning from experience, but most Transformer-based models are trained, frozen, and then only adapt inside the prompt. - Long context is not the same as real memory. A model can read a huge prompt, but that is different from building a life history, learning from mistakes, and updating beliefs over time. - The future may be hybrid, not a clean replacement. Transformers may stay as 1 building block while newer systems add better memory, better reasoning, and better learning loops. - The most interesting possibility is that Transformers may help discover their own successor. AI agents are already getting better at research and coding, so the next architecture may come from AI-assisted architecture search. ------- - Benchmarks are a problem. Many public benchmarks are easy to game, so they may show leaderboard strength without proving deeper intelligence. - Perplexity is still probably a great metric to evaluate frontier models,, because it tests prediction quality. --- Overall, Transformers continue to dominate, but the frontier is clearly widening. Pathway’s BDH (Dragon Hatchling — brain-inspired reasoning architecture), Sakana AI’s CTMs (Continuous Thought Machines — models that think over time), and Liquid AI’s LFMs (Liquid Foundation Models — efficient multimodal foundation models) - all of these show how the frontier is expanding. --- From “Pathway (pathway[.]com)” Youtube channel (link in comment) @zuzanna_pathway

102

100

89K

Probability and Statistics

7 days ago

@vincent_koc @openclaw @steipete @nvidia @Microsoft This is great news! Congratulations @vincent_koc! 😀

120

muditjps retweeted

@probnstat

9 days ago

Last week’s Post-Transformer debate post raised one question: Can long term memory become part of the architecture? It points to one promising mathematical idea behind Post Transformer AI: Linear attention in high dimension with persistent state. In a standard Transformer, memory is handled through caching context. The model keeps previous keys and values in small dimension d, then attends over them. But this is still token history. BDH (Dragon Hatchling) – one of the Post-Transformer architectures, takes a different route. The paper describes BDH's state space as fixed and large, with the macro interpretation of associative memory, like KV cache, but organized differently. Each layer has a persistent state matrix: ρₗ ∈ Rⁿˣᵈ Here: n = neuronal or concept dimension d = low rank synaptic dimension d << n The key idea is that state is aligned to neurons, in high dimensional space (n in the order of billions). A Transformer stores token history.Whereas BDH-GPU (a tensor-friendly version of the BDH architecture) evolves state, similar to State-Space Models. This is where the brain analogy becomes useful. The brain does not append every experience into a longer transcript. It has a large bounded substrate of neurons and synapses, where experience changes connections sparsely and with high parallelism. BDH GPU expresses a related idea computationally: not memory as a longer context window, but memory as a large, evolving internal state. Why it matters: – no Transformer style hard context window. practically enabling a infinite context window in a reasoning model. – linear attention in a large neuronal dimension – sparse positive activations – persistent state instead of only token history The deeper insight: Long horizon reasoning may not come from storing more tokens. It may very well come from better state dynamics.

probnstat's tweet photo. Last week’s Post-Transformer debate post raised one question: Can long term memory become part of the architecture?

It points to one promising mathematical idea behind Post Transformer AI: Linear attention in high dimension with persistent state.

In a standard Transformer, memory is handled through caching context.

The model keeps previous keys and values in small dimension d, then attends over them. But this is still token history.

BDH (Dragon Hatchling) – one of the Post-Transformer architectures, takes a different route.

The paper describes BDH's state space as fixed and large, with the macro interpretation of associative memory, like KV cache, but organized differently.

Each layer has a persistent state matrix: ρₗ ∈ Rⁿˣᵈ

Here:

n = neuronal or concept dimension
d = low rank synaptic dimension
d << n

The key idea is that state is aligned to neurons, in high dimensional space (n in the order of billions).

A Transformer stores token history.Whereas BDH-GPU (a tensor-friendly version of the BDH architecture) evolves state, similar to State-Space Models.

This is where the brain analogy becomes useful. The brain does not append every experience into a longer transcript. It has a large bounded substrate of neurons and synapses, where experience changes connections sparsely and with high parallelism.

BDH GPU expresses a related idea computationally:

not memory as a longer context window,
but memory as a large, evolving internal state.

Why it matters:

– no Transformer style hard context window. practically enabling a infinite context window in a reasoning model.
– linear attention in a large neuronal dimension
– sparse positive activations
– persistent state instead of only token history

The deeper insight:

Long horizon reasoning may not come from storing more tokens.
It may very well come from better state dynamics.

113

Who to follow

Prachi Singhal

@Prachiitis

product designer | learning, exploring and innovating through design | currently @nift

Prakhar Prakash Bhardwaj

@holaPrakhar

I'm a software engineer who is passionate about making open-source more accessible, creating technology to elevate people, and building community.

Prabhat Rai

@prabhat_krai

muditjps retweeted

Pathway (www.pathway.com) @pathway_com

13 days ago

Bigger models won't fix enterprise AI. Smarter architectures will. Follow @zuzanna_pathway, CEO of Pathway, shaping the Post-Transformer shift toward continual learning and long-horizon reasoning. Her article in Express Computer:

pathway_com's tweet photo. Bigger models won't fix enterprise AI. Smarter architectures will.

Follow @zuzanna_pathway, CEO of Pathway, shaping the Post-Transformer shift toward continual learning and long-horizon reasoning.

Her article in Express Computer:

111

319K

Probability and Statistics

15 days ago

@Mithil27360 @probnstat The architecture discussions are just catching on. There's progress already but mainstream frontier labs have to double down on their Transformer-scale moat as @YesThisIsLion said. It'll be the neolabs which will ship breakthroughs in production soon enough.

muditjps retweeted

@probnstat

15 days ago

One deep learning debate every AI researcher should care about: Transformers vs Post Transformers. At the surface, it sounds like an architecture fight. Mathematically, it is about scaling laws, memory, online learning in frontier models, and hardware limits. That is what made the recent debate interesting. It featured @lukaszkaiser, @adrian_pathway, @YesThisIsLion, and @mlech26l, hosted by @zuzanna_pathway. Transformers won the last era because multi head self attention scales empirically and fits the hardware ecosystem extremely well. But the next bottleneck may be different. Full self attention has O(n²) compute pressure with sequence length. Transformer LLMs do not natively have persistent long-term memory. RAG retrieves. Longer context conditions. Neither necessarily forms new reasoning patterns inside the model. That is why continual learning is becoming central, recently covered by @a16z. The open questions: – How can models learn after deployment without catastrophic forgetting? – How can long term memory become part of the architecture? – How can models reason over longer horizons without paying infinite context costs? – How can hardware and AI architectures co-evolve more efficiently? – And, are we chasing the right benchmarks with these goals in mind? These questions were tackled head on, with counters from @lukaszkaiser, Transformer co-inventor and core contributor to ChatGPT and GPT models. The image below summarizes some notes from the 80 minute debate.

probnstat's tweet photo. One deep learning debate every AI researcher should care about: Transformers vs Post Transformers.

At the surface, it sounds like an architecture fight. Mathematically, it is about scaling laws, memory, online learning in frontier models, and hardware limits.

That is what made the recent debate interesting. It featured @lukaszkaiser, @adrian_pathway, @YesThisIsLion, and @mlech26l, hosted by @zuzanna_pathway.

Transformers won the last era because multi head self attention scales empirically and fits the hardware ecosystem extremely well. But the next bottleneck may be different.

Full self attention has O(n²) compute pressure with sequence length. Transformer LLMs do not natively have persistent long-term memory. RAG retrieves. Longer context conditions. Neither necessarily forms new reasoning patterns inside the model.

That is why continual learning is becoming central, recently covered by @a16z.

The open questions:

– How can models learn after deployment without catastrophic forgetting?
– How can long term memory become part of the architecture?
– How can models reason over longer horizons without paying infinite context costs?
– How can hardware and AI architectures co-evolve more efficiently?
– And, are we chasing the right benchmarks with these goals in mind?

These questions were tackled head on, with counters from @lukaszkaiser, Transformer co-inventor and core contributor to ChatGPT and GPT models.

The image below summarizes some notes from the 80 minute debate.

15 days ago

@selfawareatom @VinayakGavariya Wohoo, nice one @VinayakGavariya 🔥😃

muditjps retweeted

AWS Startups

@AWSstartups

17 days ago

Transformers unlocked massive productivity gains for startups and enterprises alike. But bigger models and more compute aren't solving the 95% failure rate in enterprise AI. The problem? An architecture with no memory. Pathways BDH is rethinking the foundation — building a post-transformer approach to enterprise AI on AWS that learns and adapts over time.

@zuzanna_pathway @lukaszkaiser @adrian_pathway @YesThisIsLion @mlech26l

17 days ago

341

muditjps retweeted

Pathway (www.pathway.com) @pathway_com

18 days ago

What comes after the Transformer? Zuzanna Stamirowska puts the debate out in the open, with the very inventors of Transformer and Post-Transformer architectures! Watch the 5-minute highlights. Follow @zuzanna_pathway and hit the bell, full fight drops tomorrow.

20K

18 days ago

@zuzanna_pathway @lukaszkaiser @adrian_pathway @mlech26l @YesThisIsLion Can't wait for the full video!

muditjps retweeted

18 days ago

Transformer vs Post-Transformer: The 5-minute KO compilation is live now. 🥊 @lukaszkaiser (co-invented Transformer & co-created ChatGPT) @adrian_pathway (invented BDH and is CSO of Pathway) @mlech26l (co-invented LNNs & is CTO of Liquid AI) @YesThisIsLion (co-invented Transformer with Łukasz, now CTO of Sakana AI) Moderated by @dexhorthy (CEO, HumanLayer) and me. Full debate drops soon. Turn on notifications to catch the complete fight. This is the ultimate source of truth on the subject.

23K

18 days ago

@vivoplt @karpathy @fchollet @ylecun @AndrewYNg @rasbt @dair_ai @lilianweng @jeremyphoward @simonw @_akhaliq @ID_AA_Carmack @gwern @goodside @drfeifei @demishassabis @zuzanna_pathway, she's our CEO (Pathway) and co-author of BDH architecture. Mostly focuses on Post-Transformer developments and interactions with frontier model makers.

538

25 days ago

@zuzanna_pathway @YesThisIsLion Well said! In startups too, we obsess over leading indicators and foundational shifts instead of just chasing short-term metric wins. Same logic applies here. Understanding the Post-Transformer era involves resetting focus towards what will take us to the next frontier!

30 days ago

@zuzanna_pathway It doesn't get better than this! 🔥

about 1 month ago

@zuzanna_pathway @pathway_com @lukaszkaiser @adrian_pathway @mlech26l @YesThisIsLion @dexhorthy Nothing short of legendary! 🔥

muditjps retweeted

about 1 month ago

Tonight in SF – @pathway_com brings the inventors to settle it in the ring. 🥊 Transformer vs Post-Transformer: The Deciding Round. Featuring: @lukaszkaiser, @adrian_pathway, @mlech26l, @YesThisIsLion, @dexhorthy and myself!

153K

muditjps retweeted

about 1 month ago

But the future of AI shouldn’t be declared in a press release. It should be argued in public by the people actually building it. The decision will rest with an audience full of AI builders. May 5, 5PM - San Francisco

259