Wu Lin @LinYorker - Twitter Profile

Pinned Tweet

almost 2 years ago

#ICML2024 Can We Remove the Square-Root in Adaptive Methods? https://t.co/hD604GmB0N Root-free (RF) methods are better on CNNs and competitive on Transformers compared to root-based methods (AdamW) Removing the root makes matrix methods faster: Root-free Shampoo in BFloat16 /1

LinYorker's tweet photo. #ICML2024
Can We Remove the Square-Root in Adaptive Methods?
https://t.co/hD604GmB0N
Root-free (RF) methods are better on CNNs and competitive on Transformers compared to root-based methods (AdamW)

Removing the root makes matrix methods faster: Root-free Shampoo in BFloat16 /1 https://t.co/n8xICjTz3t

9

61

16

33

13K

Wu Lin @LinYorker

about 8 hours ago

@MarktHart125849 I mean this repo. https://t.co/2XWz4yYJgC

1

0

23

Wu Lin @LinYorker

about 17 hours ago

On one hand, it is essential to tune baseline methods well on a model. On the other hand, it may be better to avoid using a model/architecture that has been modified and optimized for a single method for 1.5 years.

Konstantin Mishchenko

@konstmish

1 day ago

I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising

5

90

5

18

20K

1

4

0

2

963

Wu Lin @LinYorker

16 days ago

Some initial steps to make Shampoo and SOAP faster https://t.co/wRGcI92w3W We are working on further improvements.

0

1

0

52

Who to follow

Mark Schmidt

@MarkSchmidtUBC

So machine learning. Very optimization. Wow. Such UBC.

Zekun Wang (ZenMoore) 🔥

@ZenMoore1

#LLM #MLLM #GenAI Researcher @Kling_ai

Konstantin Mishchenko

@konstmish

Research Scientist @AIatMeta Previously Researcher @ Samsung AI Outstanding Paper Award @icmlconf 2023 Action Editor @TmlrOrg I tweet about ML papers and math

Wu Lin @LinYorker

about 1 month ago

We will make Shampoo/SOAP, including KL-Shampoo/KL-SOAP, faster. Our goal is to match Muon's runtime while maintaining Shampoo/SOAP's strong per-step performance. Stay tuned for new updates.

Lucas Nestler

@Clashluke

about 1 month ago

KL Shampoo and KL SOAP outperform their non-KL counterparts by learning the preconditioners compositionally, so that each stage corrects what remains after the last. Available in HeavyBall 3.1.1, with major PSGD stability backports.

Clashluke's tweet photo. KL Shampoo and KL SOAP outperform their non-KL counterparts by learning the preconditioners compositionally, so that each stage corrects what remains after the last.

Available in HeavyBall 3.1.1, with major PSGD stability backports. https://t.co/DbTJKXVYZV

3

130

18

102

31K

1

46

4

24

3K

Wu Lin @LinYorker

16 days ago

@wen_kaiyue You may want to have a look at this paper for further improvement https://t.co/wRGcI92w3W

0

5

0

1

158

Wu Lin @LinYorker

about 1 month ago

LinYorker's tweet photo. https://t.co/2GzefQaHlw

0

2

1

0

79

Wu Lin @LinYorker

about 1 month ago

@_arohan_ Muon

1

12

2

4

237

Wu Lin @LinYorker

about 1 month ago

also, short sided KL-Shampoo = short-sided Shampoo^2 = Muon

1

3

1

112

LinYorker retweeted

Wu Lin @LinYorker

about 2 months ago

@weijie444 Looks like a KFAC-based method with modern clipping? G(ZZ^T)^{-1} is known as the FOOF update https://t.co/Bgh2m79ZTK while msgn() can be interpreted as "generalized (preconditioned) gradient norm clipping" https://t.co/DPyGmHecz1 .

2

51

6

27

6K

Wu Lin @LinYorker

about 2 months ago

@weijie444 Looks like a KFAC-based method with modern clipping? G(ZZ^T)^{-1} is known as the FOOF update https://t.co/Bgh2m79ZTK while msgn() can be interpreted as "generalized (preconditioned) gradient norm clipping" https://t.co/DPyGmHecz1 .

Weijie Su

@weijie444

about 2 months ago

We released "The Newton--Muon Optimizer" . We show that Muon is secretly an implicit Newton method, and use this insight to build a better one. 1/n Paper: https://t.co/Ua54426bWB

12

913

111

683

95K

2

51

6

27

6K

Wu Lin @LinYorker

6 months ago

@MarkSchmidtUBC @EmtiyazKhan This work is a joint effort with Scott C. Lowe, @f_dangel, @runame_, Zikun Xu, and @RogerGrosse. Stay tuned for more updates coming soon.

0

4

0

112

Wu Lin @LinYorker

6 months ago

Within an information-geometric framework, we reconnect Shampoo/SOAP with both classical quasi-Newton ideas and Gaussian whitening, and develop practical methods that naturally handle tensor-valued weights in language model pre-training. https://t.co/PJ4AVxPgRC opt-ml workshop

LinYorker's tweet photo. Within an information-geometric framework, we reconnect Shampoo/SOAP with both classical quasi-Newton ideas and Gaussian whitening, and develop practical methods that naturally handle tensor-valued weights in language model pre-training. https://t.co/PJ4AVxPgRC opt-ml workshop https://t.co/Q1nD1saabN

1

8

7

3

1K

Wu Lin @LinYorker

6 months ago

This work builds on my ICML 2019 paper (with @MarkSchmidtUBC and @EmtiyazKhan), extending a variational Bayes-based geometric framework to modern NN optimization. It can be used to design methods for Bayesian inference, numerical optimization, and gradient-free optimization.

1

3

0

125

LinYorker retweeted

Runa Eschenhagen @runame_

7 months ago

1/9 In practice, the Shampoo optimizer crucially relies on several heuristics. In our NeurIPS 2025 spotlight paper, we investigate the role of learning rate grafting and infrequent preconditioner updates in Shampoo by decomposing its preconditioner. https://t.co/TfI1gwMrFs

runame_'s tweet photo. 1/9 In practice, the Shampoo optimizer crucially relies on several heuristics.

In our NeurIPS 2025 spotlight paper, we investigate the role of learning rate grafting and infrequent preconditioner updates in Shampoo by decomposing its preconditioner.

https://t.co/TfI1gwMrFs https://t.co/rbH6XikCT2

3

91

21

63

13K

Wu Lin

@LinYorker

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users