Minhak Song @MinhakSong - Twitter Profile

about 1 month ago

Our new optimizer AMUSE: Muon + Schedule-Free + time-varying SF momentum. No LR schedule needed, beats tuned scheduled baselines. Two concurrent works converging on similar ideas: • ScheduleFree+ (@aaron_defazio): SF-AdamW + time-varying SF momentum • SF-NorMuon (@jlylekim)

Jueun Kim @jueunkim_0525

about 1 month ago

🚨New Optimizer Paper AMUSE: Anytime MUon with Stable gradient Evaluation AMUSE combines Muon with Schedule-Free-style gradient evaluation for stable anytime training without LR decay. • Stronger 124M / 720M / 1B pretraining • Strong ImageNet / ViT fine-tuning performance.

jueunkim_0525's tweet photo. 🚨New Optimizer Paper
AMUSE: Anytime MUon with Stable gradient Evaluation

AMUSE combines Muon with Schedule-Free-style gradient evaluation for stable anytime training without LR decay.

• Stronger 124M / 720M / 1B pretraining
• Strong ImageNet / ViT fine-tuning performance. https://t.co/Y1qQnpDt2n

16

322

40

204

44K

0

19

1

4

2K

MinhakSong retweeted

Scientific Methods for Understanding Deep Learning @scifordl

2 months ago

Minhak Song from KAIST is telling us about "Zeroth-Order Optimization at the Edge of Stability"

0

13

4

0

755

Minhak Song @MinhakSong

4 months ago

I'm happy to share that our paper (led by @DengShenyang24) won the Best Student Paper Award at ALT 2026! Paper: https://t.co/lAAXGcSCnG

Shenyang Deng ✈️ ICML2026

@DengShenyang24

4 months ago

It‘s an honor to receive the Best Student Paper Award at #ALT2026 (37th Algorithmic Learning Theory) ! 🏆 Huge thanks to my amazing collaborators Boyao，@Collapsar0000 ，@Tianyu0628 ，@MinhakSong ，@nsfzyzz ！ Had a great time at the Fields Institute in Toronto. 🇨🇦 Looking forward to attending ALT again next time! ✨

DengShenyang24's tweet photo. It‘s an honor to receive the Best Student Paper Award at #ALT2026 (37th Algorithmic Learning Theory) ! 🏆

Huge thanks to my amazing collaborators Boyao，@Collapsar0000 ，@Tianyu0628 ，@MinhakSong ，@nsfzyzz ！

Had a great time at the Fields Institute in Toronto. 🇨🇦 Looking forward to attending ALT again next time! ✨

0

24

7

0

4K

2

16

0

2

2K

MinhakSong retweeted

Jeremy Cohen @deepcohen

9 months ago

Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.

18

1K

211

1K

235K

MinhakSong retweeted

Jason Lee @jasondeanlee

12 months ago · Kensington

Really nice use of the central flow framework!

0

22

1

9

4K

MinhakSong retweeted

Konstantin Mishchenko

@konstmish

12 months ago

Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.

konstmish's tweet photo. Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape. https://t.co/gZKaivYAu8

4

139

17

98

14K

Minhak Song @MinhakSong

12 months ago

@konstmish Thanks for sharing Konstantin!

0

2

0

229

MinhakSong retweeted

Chanwoo Park

@chanwoopark20

about 1 year ago

Interesting perspective. Misspecification does matter.

0

4

1

1K

MinhakSong retweeted

Gokul Swamy @g_k_swamy

about 1 year ago

Very clear paper fleshing out different extensions of the story we outlined in https://t.co/G7XfEtcD6u!

0

15

3

8

2K

MinhakSong retweeted

Simon Shaolei Du

@SimonShaoleiDu

about 1 year ago

PPO vs. DPO? 🤔 Our new paper proves that it depends on whether your models can represent the optimal policy and/or reward. Paper: https://t.co/qNWwWhQQpA Led by @smellycat_ZZZ @MinhakSong

0

97

17

72

12K

Minhak Song @MinhakSong

about 1 year ago

RLHF vs DPO under reward and/or policy model misspecification—when does each method succeed? Our new paper provides a fine-grained theoretical comparison. 📄 https://t.co/vdpAiQHu5l

Ruizhe Shi @smellycat_ZZZ

about 1 year ago

Two-stage RLHF or one-stage DPO: Which one is better for learning from preferences? Equal under strong assumptions, but representation differences break the tie. Our paper reveals their fine-grained performance gaps under various conditions. paper: https://t.co/B3OD6YRAts

smellycat_ZZZ's tweet photo. Two-stage RLHF or one-stage DPO: Which one is better for learning from preferences?

Equal under strong assumptions, but representation differences break the tie. Our paper reveals their fine-grained performance gaps under various conditions.

paper: https://t.co/B3OD6YRAts https://t.co/ZsxqdxlgNp

3

55

12

40

16K

0

14

2

5

3K

Minhak Song

@MinhakSong

Last Seen Users on Sotwe

Trends for you

Most Popular Users