Runa Eschenhagen @runame_ - Twitter Profile

Pinned Tweet

4 months ago

1/14 Is Muon “better” than Shampoo? We argue that their relationship parallels Adam's relationship with Signum. Analogous to @lukas_balles and Hennig’s (2018) decomposition of Adam into element-wise scaled Signum, we can decompose Shampoo as left- and right-adapted Muon.

runame_'s tweet photo. 1/14 Is Muon “better” than Shampoo?

We argue that their relationship parallels Adam's relationship with Signum. Analogous to @lukas_balles and Hennig’s (2018) decomposition of Adam into element-wise scaled Signum, we can decompose Shampoo as left- and right-adapted Muon. https://t.co/XoaDFainkd

3

264

45

256

32K

runame_ retweeted

Bruno Mlodozeniec

@brunorganised

3 days ago

Awesome to see this work from Marin. One detail I’m especially pleased about: their final recipe adapts ideas from our Complete(d)P work, especially horizon and batch-size hyperparameter transfer, and they verify it scales to very large scales.

1

12

4

5

3K

runame_ retweeted

Bruno Mlodozeniec

@brunorganised

3 days ago

Guys, you need London training data. Trust me, the model won’t generalise. We clean on the left. Pls

0

2

1

0

215

runame_ retweeted

Aaron Defazio

@aaron_defazio

18 days ago

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://t.co/LzjIIsOlG8

aaron_defazio's tweet photo. 🚨 New Paper 🚨
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training!
https://t.co/LzjIIsOlG8

7

421

56

303

85K

Who to follow

Alexander Immer

@a1mmer

PhD student in machine learning @ETH, @MPI_IS, and student researcher @GoogleAI | Previously MSc @EPFL_en and intern @RIKEN_AIP_EN.

Emtiyaz Khan

@EmtiyazKhan

Team leader at @RIKEN_AIP_EN. Opinions my own. Follow me at https://t.co/jXDOS1HKXE

runame_ retweeted

19 days ago

Ownership is the biggest factor in quality research: do the drivers of the work actually feel a stake in its success, a sense of responsibility and autonomy, resilience, and a desire to make a contribution that transcends a conference publication or an immediate reward signal?

5

128

9

40

13K

runame_ retweeted

Zakhar Shumaylov @Zakobian

19 days ago

Many believe that optimizers like Muon perform well because of their connection to spectral geometry. But this is not the case! In fact, replacing the spectrum of the update with random or even inverted singular values performs remarkably similar! https://t.co/qxGQBvEuml

Zakobian's tweet photo. Many believe that optimizers like Muon perform well because of their connection to spectral geometry.

But this is not the case!

In fact, replacing the spectrum of the update with random or even inverted singular values performs remarkably similar!

https://t.co/qxGQBvEuml https://t.co/OQhvjRbVuR

4

154

11

138

13K

runame_ retweeted

typedfemale

@typedfemale

24 days ago

they just created a million muon variants

6

123

6

10K

runame_ retweeted

Thomas Pethick @tmpethick

23 days ago

1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.

tmpethick's tweet photo. 1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay.

The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep. https://t.co/15obqAAuYa

2

128

21

112

9K

runame_ retweeted

Zachary Nado @zacharynado

23 days ago

zacharynado's tweet photo. https://t.co/GaRoW82I45

1

94

6

12

11K

runame_ retweeted

Zachary Nado @zacharynado

24 days ago

the sloptimizer field is just getting started with shampoo and muon gen algorithms, the graveyard of adam variants got so bad you can't list them all on a page

zacharynado's tweet photo. the sloptimizer field is just getting started with shampoo and muon gen algorithms, the graveyard of adam variants got so bad you can't list them all on a page https://t.co/p9ZJB1e4L6

19

317

37

109

43K

Runa Eschenhagen @runame_

25 days ago

@bilaltwovec special guests 🤝

0

1

0

97

runame_ retweeted

Bruno Mlodozeniec

@brunorganised

about 1 month ago

I've been really impressed with Typst for a while, but today I discovered their document compiler is so fast someone built a 3d game that runs in the online IDE. You move by typing and the results are instantly rendered live in the document preview.

1

7

1

3

310

runame_ retweeted

Kit Fraser-Taliente @KitF_T

about 1 month ago

trained the first natural language autoencoder on gpt-2 almost a year ago, now we have one on mythos.🥲 do read the paper/play with the live demo! so excited it's finally out.

12

209

12

53

13K

runame_ retweeted

Lucas Nestler

@Clashluke

about 1 month ago

KL Shampoo and KL SOAP outperform their non-KL counterparts by learning the preconditioners compositionally, so that each stage corrects what remains after the last. Available in HeavyBall 3.1.1, with major PSGD stability backports.

Clashluke's tweet photo. KL Shampoo and KL SOAP outperform their non-KL counterparts by learning the preconditioners compositionally, so that each stage corrects what remains after the last.

Available in HeavyBall 3.1.1, with major PSGD stability backports. https://t.co/DbTJKXVYZV

3

129

18

102

31K

runame_ retweeted

Jamie Simon @learning_mech

about 1 month ago

1/ Deep learning is going to have a scientific theory. We can see the pieces starting to come together, and it's looking a lot like physics! We're releasing a paper pulling together these emerging threads and giving them a name: learning mechanics. 🔨 https://t.co/92nSIHameW 🔧

learning_mech's tweet photo. 1/ Deep learning is going to have a scientific theory. We can see the pieces starting to come together, and it's looking a lot like physics!

We're releasing a paper pulling together these emerging threads and giving them a name: learning mechanics.

🔨 https://t.co/92nSIHameW 🔧 https://t.co/3cshMD33bl

53

2K

293

2K

304K

runame_ retweeted

Bruno Mlodozeniec

@brunorganised

about 2 months ago

It’s such a beautifully diverse, yet cohesive book. There is even a section with hot-takes on causality that still feels pertinent decades later

brunorganised's tweet photo. It’s such a beautifully diverse, yet cohesive book. There is even a section with hot-takes on causality that still feels pertinent decades later https://t.co/6zKz66kYWZ

1

50

2

43

4K

runame_ retweeted

Emtiyaz Khan @EmtiyazKhan

about 2 months ago

If you want to work with our group while living in New-York and also spend time in Tokyo and Germany!, check out the new positions available at the NYU and Flatiron Institute by @cosmo_shirley post-doc: https://t.co/TvVTBiAiol research-scientist: https://t.co/ivqess9uKb

8

173

23

117

22K

runame_ retweeted

Bruno Mlodozeniec

@brunorganised

2 months ago

I'm surprised this flew under my radar, seems like a very cool paper You can extend µP almost trivially to the Mixture of Expert setting if you keep the number of experts fixed. A proper handling of hyperparameters in the MoE setting through large number of expert asymptotics has seemed elusive, so it's awesome to see it done!

brunorganised's tweet photo. I'm surprised this flew under my radar, seems like a very cool paper

You can extend µP almost trivially to the Mixture of Expert setting if you keep the number of experts fixed. A proper handling of hyperparameters in the MoE setting through large number of expert asymptotics has seemed elusive, so it's awesome to see it done!

2

105

20

90

8K

Runa Eschenhagen @runame_

2 months ago

@breskanu We use grafting from Adam here, so the update’s scale is determined by Adam’s update scale.

1

0

85

runame_ retweeted

Juno KIM @junokim_ai

2 months ago

Excited to share our new paper on sharp capacity scaling of the Muon optimizer! Joint work with @EshaanNichani Denny Wu @albertobietti @jasondeanlee: https://t.co/v1k1B4mSkG (1/7)

4

125

31

71

21K

Runa Eschenhagen @runame_

2 months ago

GPA is as general as its name implies!

Hao-Jun Michael Shi @hjmshi

2 months ago

1/10 Are DiLoCo and Schedule-Free actually related? A brief history and unusually late advertisement for our work: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (see https://t.co/ESaSU8kwpx).

hjmshi's tweet photo. 1/10 Are DiLoCo and Schedule-Free actually related? A brief history and unusually late advertisement for our work: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (see https://t.co/ESaSU8kwpx). https://t.co/6JXlucv3iC

1

37

12

21

10K

0

1

0

2

330

Runa Eschenhagen

@runame_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users