@JohnCLangford Updates on the Dion codebase (https://t.co/jVz7Fxv6B1), please check them out!
- Dion2 (https://t.co/VtAoq2eH01), which has much simpler math than Dion.
- NorMuon (https://t.co/ZHaNLWQO2y) thanks to @li_zichong.
New improvement in Dion leads to a speedup that makes orthonormal updates (eg. Muon) more scalable for larger matrices. The trick: carefully using Newton-Schulz (on smaller matrices) as Dion's backend. Updates to our microsoft/dion codebase are coming soon---stay tuned!
Join us on Sept 24 at 8 AM PT for Microsoft Research Forum Season 2 – a virtual series highlighting purposeful research and its real-world impact, from fundamental exploration to advancing AI responsibly, scaling innovation through products and open source, and driving positive change for society. Register now: https://t.co/eWh5h1NZ7N
@jxbz love the repo! clean code, good practices but still not overly over-engineered, triton kernels, well documented, simple reference implementations alongside optimized code. nice
I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice
(1/2)
[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.
Since nobody asked :-), here is my list of papers not to be missed from ICML:
1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it).
2) MARS: Unleashing the Power of Variance Reduction for Training Large Models
3) ...
@orvieto_antonio@micahgoldblum@teodorasrec@jonasgeiping Nice results! One question: wouldn’t large (global-)batch size be more practical for distributed training? Does that mean still SGD is not effective for large scale?
Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.