Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics https://t.co/ydXX28xmAd 1/3
@HeMuyu0327@SonglinYang4 This is interesting! Does the Ev table still have a learnable weight matrix? You mention it’s normalized somehow, how so? Thanks
"It's easier to tune the LR for method A than for B."
We tried to formalize this for model-based stochastic optimization methods.
We find a key quantity, called stability index, that describes how stable a (weakly) convex bound is as a function of LR.
📚https://t.co/JIrG0gXqXL
@ruuustem_10 Yes good point! It still irks me that we don't fully understand non-Euclidean methods on quadratics. This is a must if we are to rely on smoothness assumptions to understand Muon
@Ji_Ha_Kim@YouJiacheng@noahamsel@ejarlebring What problem are you referring to? This example just shows that the optimal polynomial approx to sign under the L2 norm does not satisfy the equioscillation theorem. The equioscillation theorem is about the L infinity norm.
@elon_lit Nice work, this looks very interesting. Curiously, we showed that Adam explicitly tracks this same centered gradient variance, and this SNR threshold looks very similar to the square of Adam update, see here https://t.co/H27RIg2nqK
Does this mean Adam is tracking this noise?
When β_1=β_2, we can first re-write Adam as below, where instead of the standard uncentered 2nd momentum, we have something that looks a weird variance estimator. Fun fact, it is an online estimate of variance! Let me explain ...
And now we are very proud and humbled to have received the ICLR 2026 Honorable Mention award for this work https://t.co/RgsVc3iv0G Very fun to have found this useful math nugget that can actually speed-up LLM training.
Are you interested in the new Muon/Scion/Gluon method for training LLMs?
To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x
@_arohan_@tonysilveti Let me save you some time. If you keep following this logic of a closed form prox, and regularized secant equation, you get a new quasi-Newton method that works for non-convex. But it turns out, this was already done here: https://t.co/6OKrCYAhyO
@tonysilveti I thought only criteria 1 and 2 were directly motivated through the secant equation. I don't see any such direct link of criteria 3 to secant equation. In any case its simply E|| P dg - d \theta||_{P{-1}}^2.
We've just finished some work on improving the sensitivity of Muon to the learning rate, and exploring a lot of design choices. If you want to see how we did this, follow me ....1/x (Work lead by the amazing @CrichaelMawshaw)
@jeffreycider@CV_novel_plume Using the optimal polynomials instead would improve exactly iteration complexity, that is, require slightly fewer iterations to reach a desired loss
@CV_novel_plume I agree with this statement. Over tuning an optimizer to one problem doesn't really teach us anything. This is also why I find the AlgoPerf benchmark interesting for comparing optimizers https://t.co/5ATa0kJxXL specially the self-tuning track
@FengzhuoZhang About their Hybrid Newton-Schulz in the v4 report, I understand they change the polynomials after 8 steps to ensure convergence. But it would converge even faster if they just use the *optimal* sequence of 10 polynomials, as we proposed here: https://t.co/PssM4mX0CH
Are you interested in the new Muon/Scion/Gluon method for training LLMs?
To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x
@torchcompiled Nice idea! For the retraction map, you may want to try the optimal polynomials instead. For instance, you could just apply the first two optimal polynomials to correct the approximation. We had an iclr paper on this : https://t.co/LezL38cPua
Are you interested in the new Muon/Scion/Gluon method for training LLMs?
To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x
Are you interested in the new Muon/Scion/Gluon method for training LLMs?
To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x
@bozavlado@giffmana Yeah, this is mad, and the same issue always of ADMM methods applied in this way. Unless these copies of models are distributed across different machines, in makes no sense!