🚀 Can RLVR models find their own frontier?
In our #ICML paper, we prove that mixed-difficulty RL can induce an implicit curriculum: easier tasks become learnable first, then pull harder tasks into reach.
(1/n)
Large Language Models (LLMs) exhibit “slash patterns” in attention maps — a key mechanism behind prefilling acceleration.
We take a first step toward understanding why they emerge.
Main findings:
▶️ Slash patterns are OOD-generalizable
▶️ Queries and keys on these heads are near rank-one and carry little contextual information.
▶️RoPE is the primary source of the slash pattern.
Blog link:
https://t.co/uhE3y7i5xW
A thread 🧵
This new work shows how RoPE induces slash patterns in attention that are tied to in-context learning, supported by both empirical and theoretical analysis. Very cool work!
🔗 https://t.co/Dm22i3D6YR
Excited to share our recent work! We provide a mechanistic understanding of long CoT reasoning in state-tracking: when do transformers length-generalize strongly, when they stall, and how recursive self-training pushes the boundary. 🧵(1/8)
Why does Muon outperform Adam—and how?
🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning
Three Key Findings:
> Associative memory parameters are the main beneficiaries of Muon, compared to Adam.
> Muon yields more isotropic weights than Adam.
> In heavy-tailed tasks, Muon significantly improves tail-class learning compared to Adam.
Paper Link:
https://t.co/cStSwWDdPE
A thread 🧵
🚨 Excited to share our theoretical exploration of the in-context learning dynamics of the one-layer transformer!
Introduced new techniques to analyze how softmax drives attention weights to converge globally via different training phases.
🔍: https://t.co/JejFL6IAeG
Joint work w/ @YuanC233 & Yingbin Liang