Shuhua Yu

@paddlepaddle_

working on large-scale training of generative ai and optimization. research scientist at meta fair, phd from canegie mellon.

Joined October 2023

18 Following

8 Followers

3 Posts

Shuhua Yu @paddlepaddle_

about 1 month ago

Introducing HTMuon, incorporating spectral correction over Muon and thus preserving some more variance of magnitudes in spectral directions. Efficient approximation of such p exponent can be done similar to newton-schulz steps. Thanks @TianyuPang327 and wonderful collaborators!

Tianyu Pang

@TianyuPang327

about 1 month ago

🎉 Excited to share our recent Findings of ACL 2026 paper, HTMuon! Muon has recently shown promising results in LLM training. But can we further improve its update rule? In our new work, we study Muon from the perspective of Heavy-Tailed Self-Regularization (HT-SR) theory and introduce HTMuon, a simple yet effective spectral correction for Muon. Our key contributions are: 1. Understanding a limitation of Muon. Muon’s orthogonalized update rule can over-emphasize noise-dominated directions and suppress the emergence of heavy-tailed eigenspectral distributions in the model’s weight matrices, potentially limiting performance under HT-SR theory. 2. Introducing HTMuon. While Muon uses the orthogonalized update UV^T, HTMuon considers the more general form U\Sigma^pV^T, introducing a spectral correction. This enables HTMuon to produce heavier-tailed updates while preserving Muon’s strength in capturing parameter interdependencies. Across LLM pretraining and image classification, HTMuon consistently improves over Muon and other strong optimizers. It can also be used as a plug-in correction for existing Muon variants. For example, HTMuon reduces perplexity by up to 0.98 over Muon in LLaMA pretraining on C4. We further develop accelerated implementations and demonstrate improvements over Muon on LLaMA-1B. 3. Providing a theoretical characterization. We show that HTMuon is equivalent to steepest descent under a Schatten-q norm constraint and provide a convergence analysis in smooth non-convex settings. The results show that HTMuon retains competitive convergence guarantees while improving practical training performance. 📄 Paper: https://t.co/7yqov5p3jP 💻 Code: https://t.co/iWVtOBspcS Many thanks to my collaborators Yujie Fang, @HenryLiu0820, @DengShenyang24, @twweeb , Shuhua Yu and @nsfzyzz !

$TianyuPang327's tweet photo. 🎉 Excited to share our recent Findings of ACL 2026 paper, HTMuon! Muon has recently shown promising results in LLM training. But can we further improve its update rule? In our new work, we study Muon from the perspective of Heavy-Tailed Self-Regularization (HT-SR) theory and introduce HTMuon, a simple yet effective spectral correction for Muon. Our key contributions are: 1. Understanding a limitation of Muon. Muon’s orthogonalized update rule can over-emphasize noise-dominated directions and suppress the emergence of heavy-tailed eigenspectral distributions in the model’s weight matrices, potentially limiting performance under HT-SR theory. 2. Introducing HTMuon. While Muon uses the orthogonalized update UV^T, HTMuon considers the more general form U\Sigma^pV^T, introducing a spectral correction. This enables HTMuon to produce heavier-tailed updates while preserving Muon’s strength in capturing parameter interdependencies. Across LLM pretraining and image classification, HTMuon consistently improves over Muon and other strong optimizers. It can also be used as a plug-in correction for existing Muon variants. For example, HTMuon reduces perplexity by up to 0.98 over Muon in LLaMA pretraining on C4. We further develop accelerated implementations and demonstrate improvements over Muon on LLaMA-1B. 3. Providing a theoretical characterization. We show that HTMuon is equivalent to steepest descent under a Schatten-q norm constraint and provide a convergence analysis in smooth non-convex settings. The results show that HTMuon retains competitive convergence guarantees while improving practical training performance. 📄 Paper: https://t.co/7yqov5p3jP 💻 Code: https://t.co/iWVtOBspcS Many thanks to my collaborators Yujie Fang, @HenryLiu0820, @DengShenyang24, @twweeb , Shuhua Yu and @nsfzyzz !$

14K

313

Shuhua Yu @paddlepaddle_

about 2 months ago

Thanks Shenyang! check out rmnp, a simple yet effective preconditioner for llm optimization. we got an asymptotic theory to showcase the diagonal dominance.

Shenyang Deng ✈️ ICML2026

@DengShenyang24

about 2 months ago

1/n Please stop by👋. This is not just another ICML 2026 optimizer paper. We have rich intuition to share on why simple preconditioners like orthogonalization and row-normalization specifically benefit NNs optimization. Quick overview below 🧵

118

113

20K

Shuhua Yu @paddlepaddle_

over 1 year ago

@MaxWeichart Hi Max, i am using your tetris environment for rl study. The problem of the grouped action space is that it misses some actions. In this attached example, it missed three legitimate cases. Vertically put in the left most (twice for two rotations), and in the second left most once

Shuhua Yu

@paddlepaddle_

Last Seen Users on Sotwe

Trends for you

Most Popular Users