Robert M. Gower

@gowerrobert

Often found scribbling down math with intermittent bursts of bashing out code.

New York City, USA

Joined June 2011

348 Following

1.7K Followers

571 Posts

Pinned Tweet

Robert M. Gower @gowerrobert

over 1 year ago

Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics https://t.co/ydXX28xmAd 1/3

gowerrobert's tweet photo. Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics https://t.co/ydXX28xmAd 1/3 https://t.co/VMa0wt0RQR

1

25

7

7

6K

Robert M. Gower @gowerrobert

3 days ago

@HeMuyu0327 @SonglinYang4 This is interesting! Does the Ev table still have a learnable weight matrix? You mention it’s normalized somehow, how so? Thanks

1

1

0

0

95

Robert M. Gower @gowerrobert

4 days ago

@weijie444 Brilliant news! And I think very well deserved, I’ve become a big fan of your work and approach to research.

1

7

0

0

6K

gowerrobert retweeted

Fabian Schaipp @FSchaipp

8 days ago

"It's easier to tune the LR for method A than for B." We tried to formalize this for model-based stochastic optimization methods. We find a key quantity, called stability index, that describes how stable a (weakly) convex bound is as a function of LR. 📚https://t.co/JIrG0gXqXL

FSchaipp's tweet photo. "It's easier to tune the LR for method A than for B."

We tried to formalize this for model-based stochastic optimization methods.

We find a key quantity, called stability index, that describes how stable a (weakly) convex bound is as a function of LR.

📚https://t.co/JIrG0gXqXL https://t.co/9YM5R7b1QN

3

65

9

43

7K

Who to follow

Fabian Pedregosa

Keeping the gradients flowing since 2013. Loves open source. Sometime blogs and writes papers.

Researcher in machine learning

Konstantin Mishchenko

Verified account

Research Scientist @AIatMeta Previously Researcher @ Samsung AI Outstanding Paper Award @icmlconf 2023 Action Editor @TmlrOrg I tweet about ML papers and math

Robert M. Gower @gowerrobert

20 days ago

@ruuustem_10 Yes good point! It still irks me that we don't fully understand non-Euclidean methods on quadratics. This is a must if we are to rely on smoothness assumptions to understand Muon

0

1

0

0

38

Robert M. Gower @gowerrobert

27 days ago

@Ji_Ha_Kim @YouJiacheng @noahamsel @ejarlebring What problem are you referring to? This example just shows that the optimal polynomial approx to sign under the L2 norm does not satisfy the equioscillation theorem. The equioscillation theorem is about the L infinity norm.

1

1

0

0

92

Robert M. Gower @gowerrobert

29 days ago

@elon_lit Nice work, this looks very interesting. Curiously, we showed that Adam explicitly tracks this same centered gradient variance, and this SNR threshold looks very similar to the square of Adam update, see here https://t.co/H27RIg2nqK Does this mean Adam is tracking this noise?

Robert M. Gower @gowerrobert

12 months ago

When β_1=β_2, we can first re-write Adam as below, where instead of the standard uncentered 2nd momentum, we have something that looks a weird variance estimator. Fun fact, it is an online estimate of variance! Let me explain ...

gowerrobert's tweet photo. When β_1=β_2, we can first re-write Adam as below, where instead of the standard uncentered 2nd momentum, we have something that looks a weird variance estimator. Fun fact, it is an online estimate of variance! Let me explain ... https://t.co/meGjRT74JK

1

13

0

1

1K

0

3

0

1

187

Robert M. Gower @gowerrobert

29 days ago

And now we are very proud and humbled to have received the ICLR 2026 Honorable Mention award for this work https://t.co/RgsVc3iv0G Very fun to have found this useful math nugget that can actually speed-up LLM training.

Robert M. Gower @gowerrobert

12 months ago

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

gowerrobert's tweet photo. Are you interested in the new Muon/Scion/Gluon method for training LLMs?
To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x https://t.co/ouHS2nkU2T

3

206

24

144

24K

0

58

10

25

4K

Robert M. Gower @gowerrobert

about 1 month ago

@_arohan_ @tonysilveti Same here 🙃 maybe we should chat

0

1

0

0

46

Robert M. Gower @gowerrobert

about 1 month ago

@_arohan_ @tonysilveti Let me save you some time. If you keep following this logic of a closed form prox, and regularized secant equation, you get a new quasi-Newton method that works for non-convex. But it turns out, this was already done here: https://t.co/6OKrCYAhyO

1

3

0

0

83

Robert M. Gower @gowerrobert

about 1 month ago

@tonysilveti I thought only criteria 1 and 2 were directly motivated through the secant equation. I don't see any such direct link of criteria 3 to secant equation. In any case its simply E|| P dg - d \theta||_{P{-1}}^2.

0

6

0

1

3K

Robert M. Gower @gowerrobert

about 1 month ago

Very happy that this has now been accepted to ICML2026! Great, systematic work done by @CrichaelMawshaw

Robert M. Gower @gowerrobert

7 months ago

We've just finished some work on improving the sensitivity of Muon to the learning rate, and exploring a lot of design choices. If you want to see how we did this, follow me ....1/x (Work lead by the amazing @CrichaelMawshaw)

gowerrobert's tweet photo. We've just finished some work on improving the sensitivity of Muon to the learning rate, and exploring a lot of design choices. If you want to see how we did this, follow me ....1/x (Work lead by the amazing @CrichaelMawshaw) https://t.co/S6VLCQF47z

6

187

23

144

30K

0

45

4

21

4K

Robert M. Gower @gowerrobert

about 1 month ago

@jeffreycider @CV_novel_plume Using the optimal polynomials instead would improve exactly iteration complexity, that is, require slightly fewer iterations to reach a desired loss

0

1

0

0

44

Robert M. Gower @gowerrobert

about 1 month ago

@CV_novel_plume I agree with this statement. Over tuning an optimizer to one problem doesn't really teach us anything. This is also why I find the AlgoPerf benchmark interesting for comparing optimizers https://t.co/5ATa0kJxXL specially the self-tuning track

1

7

1

0

265

Robert M. Gower @gowerrobert

about 1 month ago

@FengzhuoZhang About their Hybrid Newton-Schulz in the v4 report, I understand they change the polynomials after 8 steps to ensure convergence. But it would converge even faster if they just use the *optimal* sequence of 10 polynomials, as we proposed here: https://t.co/PssM4mX0CH

Robert M. Gower @gowerrobert

12 months ago

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

gowerrobert's tweet photo. Are you interested in the new Muon/Scion/Gluon method for training LLMs?
To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x https://t.co/ouHS2nkU2T

3

206

24

144

24K

1

1

0

0

50

Robert M. Gower @gowerrobert

about 1 month ago

@torchcompiled Nice idea! For the retraction map, you may want to try the optimal polynomials instead. For instance, you could just apply the first two optimal polynomials to correct the approximation. We had an iclr paper on this : https://t.co/LezL38cPua

Robert M. Gower @gowerrobert

12 months ago

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

gowerrobert's tweet photo. Are you interested in the new Muon/Scion/Gluon method for training LLMs?
To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x https://t.co/ouHS2nkU2T

3

206

24

144

24K

0

3

0

0

422

Robert M. Gower @gowerrobert

about 1 month ago

And now we got the Honorable paper mention of ICLR 2026 for our work on Muon+PolarExpress!

Robert M. Gower @gowerrobert

12 months ago

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

gowerrobert's tweet photo. Are you interested in the new Muon/Scion/Gluon method for training LLMs?
To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x https://t.co/ouHS2nkU2T

3

206

24

144

24K

1

59

3

13

4K

Robert M. Gower @gowerrobert

3 months ago

@bozavlado @giffmana Yeah, this is mad, and the same issue always of ADMM methods applied in this way. Unless these copies of models are distributed across different machines, in makes no sense!

0

1

0

0

30

Robert M. Gower @gowerrobert

4 months ago

@mher_safaryan @LancasterUni Brilliant news, congrats on your new position!

1

1

0

0

24

Last Seen Users on Sotwe

Trends for you

Most Popular Users