Almost all animals sleep. Why don’t LMs?
Introducing our new work on language model sleep.
tl;dr : A periodic, recurrent “sleep” phase allows LMs to digest their context and transfer it into their weights, improving recall and reasoning on challenging tasks.
Introducing Wall Attention. Diagonal forget gates enable RoPE-free attention with exceptional length generalization.
Wall outperforms the dominant method RoPE and sophisticated data-dependent methods like Forgetting Attention (FoX). We trained models with Wall on 4k sequence length and they generalized without further training to 200k+ tokens.
Wall generalizes diagonal forget gates from linear RNNs (KDA, RWKV 7, GLA) to softmax attention through a principled induced action framework. It enables transformers to selectively remember or forget per-channel within the attention head, dramatically boosting expressivity.
Wall is production-ready. Wall retains the parallel structure of vanilla attention, is compatible with GQA & MLA, and we open-source reference Triton kernels for training and decoding. Our WallDecode kernel achieves SOTA-level decode throughput.
Continual learning over long-context is fundamentally about selective forgetting → and Wall attention is all about selective forgetting.
🚨This week's top AI/ML research papers:
- DiffusionBlocks
- A Bitter Lesson for Data Filtering
- Neural Weight Norm = Kolmogorov Complexity
- When Does LeJEPA Learn a World Model?
- Do Language Models Need Sleep?
- Parallax
- Gemini Embedding 2
- Qwen-VLA
- The MiniMax-M2 Series
- Looped Diffusion Language Models
- LocateAnything
- Learn from your own latents and not from tokens
overview for each + authors' explanations
read this in thread mode for the best experience
The Top AI Papers of the Week (May 24 - May 31)
- SkillOpt
- AutoScientists
- The Efficiency Frontier
- Language Models Need Sleep
- Adapting the Interface, Not the Model
- Forecasting Scientific Progress with AI
- Compiling Agentic Workflows into Weights
Read on for more:
~1/7~Introducing Parallax → a stronger attention variant that achieves a Pareto improvement over vanilla attention at 0.6B and 1.7B scales.
Parallax has better perplexity, better downstream accuracy, and a decode kernel that matches or beats FlashAttention.
🧵
Long-context memory management is a big problem for LLMs. Our new paper shows that "sleep" (recurrent fast weight learning) helps models learn context in a way that improves reasoning and recall. Fun project with awesome collaborators @sang_yun_lee@SeanMcleish@tomgoldsteincs
// Language Models Need Sleep //
Let your agents "sleep", folks.
On a serious note, this is a fascinating paper on getting the most from long-horizon agents.
Here is the problem with agents today: Attention scales badly with context length, so long-horizon agents keep paying a quadratic tax at inference time.
This work proposes a sleep-like consolidation step instead. The model periodically does N offline recurrent passes over recent context, writes the result into persistent fast weights in its state-space blocks, then clears the KV cache.
The effect is that extra compute moves to sleep while wake-time prediction stays low latency. On cellular automata, multi-hop graph retrieval, and a math reasoning task where a plain transformer and SSM-attention hybrids fail, longer sleep durations improve performance, with the biggest gains on examples that need deeper reasoning.
Why does it matter?
It points at an alternative to ever-larger KV caches for agents that run for a long time. Consolidate, then forget the raw tokens.
Paper: https://t.co/FfDTbhl98M
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
1/
Most long-context models are actually terrible at reasoning over past data once it leaves their active attention cache.
Even if an SSM can store the tokens, a single forward pass lacks the computational depth to think about them.
But what if we let them sleep? 🧵
New paper! Have you or a loved one been harmed by a bad multiple-choice benchmark? 😔
You may be entitled to a more reliable evaluation 🩺
At #ACL2026, we'll present BenchMarker: a toolkit to diagnose common flaws in MCQA benchmarks, inspired by best practices in education 🧑🏫🧵
Earlier this month I successfully defended my PhD and graduated from UMD! 🐢 Thanks to everyone who played large or small role in making these last 5 years an amazing experience. Next stop, Toronto!
Defense recording: https://t.co/d24HV6HsPV
Language Models Need Sleep
"Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache."
"increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."
i'm delighted to share that CLRS-Text has been accepted in @DMLRJournal! 🎉🎉🎉
check out our latest results + scaling analysis at:
https://t.co/1NvSTxxx8Y
fantastic work by the whole team, but especially @re_rayne for putting it over the finish line!
🚨 New Paper! 🚨
One of my first Ph.D. papers found that LLMs can answer multiple-choice questions without seeing the question 🤔
At #ACL2026, I'm presenting a follow-up showing that current reasoning LLMs can still do this! And quite similarly to a clever test-taker 🧑🎓🧵
We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn.
This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information.
In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can …
🔵Be created by instruction-tuning for the stream format
🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in)
🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency
🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security
🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized.
Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.
Distillation (especially on-policy) has become a pivotal component of the post-training stack.
☕ To dramatically accelerate distillation at scale, we open-source Nitrobrew, a communication-efficient, fused strategy for logit distillation. It’s built for both on- and off-policy distillation with:
100x faster loss computation
50% peak memory savings
3x faster on-policy distillation
and more!
A 🧵 (1/8)
@hayden_prairie@eliebakouch I usually see the same in terms of loss, but normally the accuracy for the looped models is better on tasks like math and code