Introducing HTMuon, incorporating spectral correction over Muon and thus preserving some more variance of magnitudes in spectral directions. Efficient approximation of such p exponent can be done similar to newton-schulz steps. Thanks @TianyuPang327 and wonderful collaborators!
๐ Excited to share our recent Findings of ACL 2026 paper, HTMuon!
Muon has recently shown promising results in LLM training. But can we further improve its update rule? In our new work, we study Muon from the perspective of Heavy-Tailed Self-Regularization (HT-SR) theory and introduce HTMuon, a simple yet effective spectral correction for Muon.
Our key contributions are:
1. Understanding a limitation of Muon. Muonโs orthogonalized update rule can over-emphasize noise-dominated directions and suppress the emergence of heavy-tailed eigenspectral distributions in the modelโs weight matrices, potentially limiting performance under HT-SR theory.
2. Introducing HTMuon. While Muon uses the orthogonalized update UV^T, HTMuon considers the more general form U\Sigma^pV^T, introducing a spectral correction. This enables HTMuon to produce heavier-tailed updates while preserving Muonโs strength in capturing parameter interdependencies. Across LLM pretraining and image classification, HTMuon consistently improves over Muon and other strong optimizers. It can also be used as a plug-in correction for existing Muon variants. For example, HTMuon reduces perplexity by up to 0.98 over Muon in LLaMA pretraining on C4. We further develop accelerated implementations and demonstrate improvements over Muon on LLaMA-1B.
3. Providing a theoretical characterization. We show that HTMuon is equivalent to steepest descent under a Schatten-q norm constraint and provide a convergence analysis in smooth non-convex settings. The results show that HTMuon retains competitive convergence guarantees while improving practical training performance.
๐ Paper: https://t.co/7yqov5p3jP
๐ป Code: https://t.co/iWVtOBspcS
Many thanks to my collaborators Yujie Fang, @HenryLiu0820, @DengShenyang24, @twweeb , Shuhua Yu and @nsfzyzz !
Thanks Shenyang! check out rmnp, a simple yet effective preconditioner for llm optimization. we got an asymptotic theory to showcase the diagonal dominance.
1/n Please stop by๐. This is not just another ICML 2026 optimizer paper. We have rich intuition to share on why simple preconditioners like orthogonalization and row-normalization specifically benefit NNs optimization. Quick overview below ๐งต
@MaxWeichart Hi Max, i am using your tetris environment for rl study. The problem of the grouped action space is that it misses some actions. In this attached example, it missed three legitimate cases. Vertically put in the left most (twice for two rotations), and in the second left most once