Yuan Cheng @YuanC233 - Twitter Profile

YuanC233 retweeted

about 22 hours ago

🚀 Can RLVR models find their own frontier? In our #ICML paper, we prove that mixed-difficulty RL can induce an implicit curriculum: easier tasks become learnable first, then pull harder tasks into reach. (1/n)

yuhuang42's tweet photo. 🚀 Can RLVR models find their own frontier?

In our #ICML paper, we prove that mixed-difficulty RL can induce an implicit curriculum: easier tasks become learnable first, then pull harder tasks into reach.

(1/n)

4

67

12

53

9K

Yuan Cheng @YuanC233

6 months ago

Excited to share our new work！

Fengzhuo Zhang

@FengzhuoZhang

6 months ago

Large Language Models (LLMs) exhibit “slash patterns” in attention maps — a key mechanism behind prefilling acceleration. We take a first step toward understanding why they emerge. Main findings: ▶️ Slash patterns are OOD-generalizable ▶️ Queries and keys on these heads are near rank-one and carry little contextual information. ▶️RoPE is the primary source of the slash pattern. Blog link: https://t.co/uhE3y7i5xW A thread 🧵

FengzhuoZhang's tweet photo. Large Language Models (LLMs) exhibit “slash patterns” in attention maps — a key mechanism behind prefilling acceleration.

We take a first step toward understanding why they emerge.

Main findings:
▶️ Slash patterns are OOD-generalizable
▶️ Queries and keys on these heads are near rank-one and carry little contextual information.
▶️RoPE is the primary source of the slash pattern.

Blog link:
https://t.co/uhE3y7i5xW

A thread 🧵

2

74

24

39

7K

0

1

0

132

YuanC233 retweeted

Francesco Bertolotti @f14bertolotti

6 months ago

This new work shows how RoPE induces slash patterns in attention that are tied to in-context learning, supported by both empirical and theoretical analysis. Very cool work! 🔗 https://t.co/Dm22i3D6YR

f14bertolotti's tweet photo. This new work shows how RoPE induces slash patterns in attention that are tied to in-context learning, supported by both empirical and theoretical analysis. Very cool work!

🔗 https://t.co/Dm22i3D6YR https://t.co/rXaHK7Qyh8

1

186

17

161

12K

YuanC233 retweeted

Yu Huang

@yuhuang42

8 months ago

Excited to share our recent work! We provide a mechanistic understanding of long CoT reasoning in state-tracking: when do transformers length-generalize strongly, when they stall, and how recursive self-training pushes the boundary. 🧵(1/8)

yuhuang42's tweet photo. Excited to share our recent work! We provide a mechanistic understanding of long CoT reasoning in state-tracking: when do transformers length-generalize strongly, when they stall, and how recursive self-training pushes the boundary. 🧵(1/8)

6

227

44

137

57K

YuanC233 retweeted

Fengzhuo Zhang

@FengzhuoZhang

9 months ago

Why does Muon outperform Adam—and how? 🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In heavy-tailed tasks, Muon significantly improves tail-class learning compared to Adam. Paper Link: https://t.co/cStSwWDdPE A thread 🧵

FengzhuoZhang's tweet photo. Why does Muon outperform Adam—and how?

🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning

Three Key Findings:

> Associative memory parameters are the main beneficiaries of Muon, compared to Adam.

> Muon yields more isotropic weights than Adam.

> In heavy-tailed tasks, Muon significantly improves tail-class learning compared to Adam.

Paper Link:

https://t.co/cStSwWDdPE

A thread 🧵

2

119

41

106

70K

YuanC233 retweeted

Yu Huang

@yuhuang42

11 months ago

New theoretical results on training multi-head transformers for multi-step reasoning!

0

12

1

2

1K

Yuan Cheng @YuanC233

over 2 years ago

Excited about our new work, thanks for the efforts of my collaborators！

Yu Huang

@yuhuang42

over 2 years ago

🚨 Excited to share our theoretical exploration of the in-context learning dynamics of the one-layer transformer! Introduced new techniques to analyze how softmax drives attention weights to converge globally via different training phases. 🔍: https://t.co/JejFL6IAeG Joint work w/ @YuanC233 & Yingbin Liang

yuhuang42's tweet photo. 🚨 Excited to share our theoretical exploration of the in-context learning dynamics of the one-layer transformer!
Introduced new techniques to analyze how softmax drives attention weights to converge globally via different training phases.
🔍: https://t.co/JejFL6IAeG

Joint work w/ @YuanC233 & Yingbin Liang

0

101

17

41

13K

0

1

0

528

Yuan Cheng

@YuanC233

Last Seen Users on Sotwe

Trends for you

Most Popular Users