Sean McLeish @SeanMcleish - Twitter Profile

Pinned Tweet

8 days ago

Offline recurrence can improve inference accuracy by iteratively refining fast weights, giving the model an adaptation mechanism at test time.

Sangyun Lee

@sang_yun_lee

8 days ago

Almost all animals sleep. Why don’t LMs? Introducing our new work on language model sleep. tl;dr : A periodic, recurrent “sleep” phase allows LMs to digest their context and transfer it into their weights, improving recall and reasoning on challenging tasks.

38

1K

113

996

119K

0

6

1

2

456

SeanMcleish retweeted

Tilde

@tilderesearch

3 days ago

Introducing Wall Attention. Diagonal forget gates enable RoPE-free attention with exceptional length generalization. Wall outperforms the dominant method RoPE and sophisticated data-dependent methods like Forgetting Attention (FoX). We trained models with Wall on 4k sequence length and they generalized without further training to 200k+ tokens. Wall generalizes diagonal forget gates from linear RNNs (KDA, RWKV 7, GLA) to softmax attention through a principled induced action framework. It enables transformers to selectively remember or forget per-channel within the attention head, dramatically boosting expressivity. Wall is production-ready. Wall retains the parallel structure of vanilla attention, is compatible with GQA & MLA, and we open-source reference Triton kernels for training and decoding. Our WallDecode kernel achieves SOTA-level decode throughput. Continual learning over long-context is fundamentally about selective forgetting → and Wall attention is all about selective forgetting.

5

224

33

144

25K

SeanMcleish retweeted

Tilde

@tilderesearch

3 days ago

https://t.co/rmTk8GMkir

7

360

41

357

87K

SeanMcleish retweeted

The AI Timeline

@TheAITimeline

4 days ago

🚨This week's top AI/ML research papers: - DiffusionBlocks - A Bitter Lesson for Data Filtering - Neural Weight Norm = Kolmogorov Complexity - When Does LeJEPA Learn a World Model? - Do Language Models Need Sleep? - Parallax - Gemini Embedding 2 - Qwen-VLA - The MiniMax-M2 Series - Looped Diffusion Language Models - LocateAnything - Learn from your own latents and not from tokens overview for each + authors' explanations read this in thread mode for the best experience

TheAITimeline's tweet photo. 🚨This week's top AI/ML research papers:

- DiffusionBlocks
- A Bitter Lesson for Data Filtering
- Neural Weight Norm = Kolmogorov Complexity
- When Does LeJEPA Learn a World Model?
- Do Language Models Need Sleep?
- Parallax
- Gemini Embedding 2
- Qwen-VLA
- The MiniMax-M2 Series
- Looped Diffusion Language Models
- LocateAnything
- Learn from your own latents and not from tokens

overview for each + authors' explanations
read this in thread mode for the best experience

6

254

47

189

13K

SeanMcleish retweeted

DAIR.AI

@dair_ai

5 days ago

The Top AI Papers of the Week (May 24 - May 31) - SkillOpt - AutoScientists - The Efficiency Frontier - Language Models Need Sleep - Adapting the Interface, Not the Model - Forecasting Scientific Progress with AI - Compiling Agentic Workflows into Weights Read on for more:

17

472

81

461

66K

SeanMcleish retweeted

Tilde

@tilderesearch

7 days ago

~1/7~Introducing Parallax → a stronger attention variant that achieves a Pareto improvement over vanilla attention at 0.6B and 1.7B scales. Parallax has better perplexity, better downstream accuracy, and a decode kernel that matches or beats FlashAttention. 🧵

tilderesearch's tweet photo. ~1/7~Introducing Parallax → a stronger attention variant that achieves a Pareto improvement over vanilla attention at 0.6B and 1.7B scales.

Parallax has better perplexity, better downstream accuracy, and a decode kernel that matches or beats FlashAttention.

🧵 https://t.co/9MOf9QpTrl

7

509

63

422

90K

SeanMcleish retweeted

Giulia Fanti @giuliacfanti

7 days ago

Long-context memory management is a big problem for LLMs. Our new paper shows that "sleep" (recurrent fast weight learning) helps models learn context in a way that improves reasoning and recall. Fun project with awesome collaborators @sang_yun_lee @SeanMcleish @tomgoldsteincs

1

22

4

18

5K

SeanMcleish retweeted

DAIR.AI

@dair_ai

10 days ago

// Language Models Need Sleep // Let your agents "sleep", folks. On a serious note, this is a fascinating paper on getting the most from long-horizon agents. Here is the problem with agents today: Attention scales badly with context length, so long-horizon agents keep paying a quadratic tax at inference time. This work proposes a sleep-like consolidation step instead. The model periodically does N offline recurrent passes over recent context, writes the result into persistent fast weights in its state-space blocks, then clears the KV cache. The effect is that extra compute moves to sleep while wake-time prediction stays low latency. On cellular automata, multi-hop graph retrieval, and a math reasoning task where a plain transformer and SSM-attention hybrids fail, longer sleep durations improve performance, with the biggest gains on examples that need deeper reasoning. Why does it matter? It points at an alternative to ever-larger KV caches for agents that run for a long time. Consolidate, then forget the raw tokens. Paper: https://t.co/FfDTbhl98M Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. // Language Models Need Sleep //

Let your agents "sleep", folks.

On a serious note, this is a fascinating paper on getting the most from long-horizon agents.

Here is the problem with agents today: Attention scales badly with context length, so long-horizon agents keep paying a quadratic tax at inference time.

This work proposes a sleep-like consolidation step instead. The model periodically does N offline recurrent passes over recent context, writes the result into persistent fast weights in its state-space blocks, then clears the KV cache.

The effect is that extra compute moves to sleep while wake-time prediction stays low latency. On cellular automata, multi-hop graph retrieval, and a math reasoning task where a plain transformer and SSM-attention hybrids fail, longer sleep durations improve performance, with the biggest gains on examples that need deeper reasoning.

Why does it matter?

It points at an alternative to ever-larger KV caches for agents that run for a long time. Consolidate, then forget the raw tokens.

Paper: https://t.co/FfDTbhl98M

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

10

241

45

185

84K

SeanMcleish retweeted

elvis

@omarsar0

10 days ago

Language models need "sleep"

10

77

9

61

15K

SeanMcleish retweeted

Grigory Sapunov

@che_shr_cat

9 days ago

1/ Most long-context models are actually terrible at reasoning over past data once it leaves their active attention cache. Even if an SSM can store the tokens, a single forward pass lacks the computational depth to think about them. But what if we let them sleep? 🧵

che_shr_cat's tweet photo. 1/
Most long-context models are actually terrible at reasoning over past data once it leaves their active attention cache.

Even if an SSM can store the tokens, a single forward pass lacks the computational depth to think about them.

But what if we let them sleep? 🧵 https://t.co/6FoLplj4Ex

1

37

5

21

2K

SeanMcleish retweeted

Nishant Balepur @NishantBalepur

9 days ago

New paper! Have you or a loved one been harmed by a bad multiple-choice benchmark? 😔 You may be entitled to a more reliable evaluation 🩺 At #ACL2026, we'll present BenchMarker: a toolkit to diagnose common flaws in MCQA benchmarks, inspired by best practices in education 🧑‍🏫🧵

NishantBalepur's tweet photo. New paper! Have you or a loved one been harmed by a bad multiple-choice benchmark? 😔

You may be entitled to a more reliable evaluation 🩺

At #ACL2026, we'll present BenchMarker: a toolkit to diagnose common flaws in MCQA benchmarks, inspired by best practices in education 🧑‍🏫🧵 https://t.co/pNxlAQsdi9

2

51

8

7

3K

SeanMcleish retweeted

himanshu

@himanshustwts

10 days ago

very cool research (and nomenclature)

18

861

79

505

55K

SeanMcleish retweeted

John Kirchenbauer @jwkirchenbauer

11 days ago

Earlier this month I successfully defended my PhD and graduated from UMD! 🐢 Thanks to everyone who played large or small role in making these last 5 years an amazing experience. Next stop, Toronto! Defense recording: https://t.co/d24HV6HsPV

jwkirchenbauer's tweet photo. Earlier this month I successfully defended my PhD and graduated from UMD! 🐢 Thanks to everyone who played large or small role in making these last 5 years an amazing experience. Next stop, Toronto!

Defense recording: https://t.co/d24HV6HsPV https://t.co/2Z3NKAgJPu

5

57

7

2

3K

SeanMcleish retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

10 days ago

Language Models Need Sleep "Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache." "increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

iScienceLuvr's tweet photo. Language Models Need Sleep

"Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache."

"increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

32

909

146

713

66K

SeanMcleish retweeted

Petar Veličković

@PetarV_93

18 days ago

i'm delighted to share that CLRS-Text has been accepted in @DMLRJournal! 🎉🎉🎉 check out our latest results + scaling analysis at: https://t.co/1NvSTxxx8Y fantastic work by the whole team, but especially @re_rayne for putting it over the finish line!

0

38

2

12

6K

SeanMcleish retweeted

Nishant Balepur @NishantBalepur

18 days ago

🚨 New Paper! 🚨 One of my first Ph.D. papers found that LLMs can answer multiple-choice questions without seeing the question 🤔 At #ACL2026, I'm presenting a follow-up showing that current reasoning LLMs can still do this! And quite similarly to a clever test-taker 🧑‍🎓🧵

NishantBalepur's tweet photo. 🚨 New Paper! 🚨

One of my first Ph.D. papers found that LLMs can answer multiple-choice questions without seeing the question 🤔

At #ACL2026, I'm presenting a follow-up showing that current reasoning LLMs can still do this! And quite similarly to a clever test-taker 🧑‍🎓🧵 https://t.co/X25UnlSJY2

50

2K

111

826

1M

SeanMcleish retweeted

Jonas Geiping

@jonasgeiping

23 days ago

We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn. This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information. In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can … 🔵Be created by instruction-tuning for the stream format 🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in) 🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency 🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security 🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized. Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.

42

1K

168

1K

156K

SeanMcleish retweeted

Tilde

@tilderesearch

about 1 month ago

Distillation (especially on-policy) has become a pivotal component of the post-training stack. ☕ To dramatically accelerate distillation at scale, we open-source Nitrobrew, a communication-efficient, fused strategy for logit distillation. It’s built for both on- and off-policy distillation with: 100x faster loss computation 50% peak memory savings 3x faster on-policy distillation and more! A 🧵 (1/8)

tilderesearch's tweet photo. Distillation (especially on-policy) has become a pivotal component of the post-training stack.

☕ To dramatically accelerate distillation at scale, we open-source Nitrobrew, a communication-efficient, fused strategy for logit distillation. It’s built for both on- and off-policy distillation with:

100x faster loss computation
50% peak memory savings
3x faster on-policy distillation
and more!

A 🧵 (1/8)

6

283

40

235

28K

SeanMcleish retweeted

Hugh Blayney @HughBlayney

about 2 months ago

1/8 Taking a look inside looped language models! We've released a new preprint on looped LLMs, an exciting new direction for scaling test-time compute. 🔁 Thanks to wonderful collaborators @arroyo_alvr @johanobandoc @pcastr @AaronCourville @mmbronstein @epomqo 🧵

1

200

26

178

24K

Sean McLeish

@SeanMcleish

about 2 months ago

@hayden_prairie @eliebakouch I usually see the same in terms of loss, but normally the accuracy for the looped models is better on tasks like math and code

0

1

0

39

Sean McLeish

@SeanMcleish

Last Seen Users on Sotwe

Trends for you

Most Popular Users