Thomas Pethick

@tmpethick

Joined July 2011

91 Following

268 Followers

155 Posts

Pinned Tweet

Thomas Pethick @tmpethick

23 days ago

1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.

tmpethick's tweet photo. 1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay.

The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep. https://t.co/15obqAAuYa

2

128

21

112

9K

tmpethick retweeted

11 days ago

SODA-AMUSE+Gram+PMuon is the recipe that wins consistently on 4/4 codex GPU instances. PMuon is a new idea: https://t.co/Zh6QCYFwGo AMUSE is Muon + ScheduleFree: https://t.co/ZBEFoRWdkA SODA is a wrapper that's more popular: https://t.co/OWuWxNFYl0 Gram is a lesser known Gram-Newton-Schultz optimization: https://t.co/qin6o8SexN

1

10

3

19

2K

Thomas Pethick @tmpethick

13 days ago

@jlylekim Looks interesting! We’re using schedule-free to get rid of weight decay tuning in https://t.co/f25xr2KEMX - I’m curious if you can somehow combine to get rid of both

1

7

1

5

580

Thomas Pethick @tmpethick

23 days ago

@tonysilveti I got a bit delayed but here we are: https://t.co/14B3mBzG4Z

Thomas Pethick @tmpethick

23 days ago

1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.

tmpethick's tweet photo. 1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay.

The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep. https://t.co/15obqAAuYa

2

128

21

112

9K

0

1

0

1

51

Who to follow

@MarkSchmidtUBC

So machine learning. Very optimization. Wow. Such UBC.

Research in math, AI and other candies. Telecom Paris / KAUST / UC Berkeley / Microsoft Research

Optimist in face of the uncertainty, PhD student @EPFL

Thomas Pethick @tmpethick

25 days ago

@tonysilveti Thank you for sharing! you beat me to it 😂 I'm really excited about this direction - will share some more thoughts tomorrow

3

5

0

0

422

Thomas Pethick @tmpethick

23 days ago

7/ Thanks to @CevherLIONS for supporting it and getting Roman Macháček on board, and @WanyunXie for scaling up the experiments and debugging runs together – it’s always a joy

0

6

1

1

640

Thomas Pethick @tmpethick

23 days ago

1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.

tmpethick's tweet photo. 1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay.

The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep. https://t.co/15obqAAuYa

2

128

21

112

9K

Thomas Pethick @tmpethick

23 days ago

6/ There are a lot of interesting questions one can ask from this perspective — please check out the paper! Paper: https://t.co/f25xr2KEMX Code: https://t.co/lqZiHiqD0t

2

8

1

1

739

Thomas Pethick @tmpethick

24 days ago

@EIFY @tonysilveti One reason I find z0 appearing interesting is for finetuning where weight decay is otherwise typically not used (I added a comment on this in the conclusion) - but that’s a story in itself

0

2

0

0

45

Thomas Pethick @tmpethick

24 days ago

@EIFY @tonysilveti We haven’t ablated this beyond 1 x chinchilla with a 124M model (the main setting where I tested things before we extrapolated across horizon and model size)

1

2

0

0

51

Thomas Pethick @tmpethick

24 days ago

@CV_novel_plume With that said I think the ODA perspective is interesting in itself and there's more to extract

0

1

0

0

68

Thomas Pethick @tmpethick

24 days ago

@CV_novel_plume Yes, this is exactly the SODA wrapper! I wanted to extract something concrete from the perspective and I was at the time trying to understand weight decay - surprisingly the first thing I tried (just using params from theory) worked without any retuning of lr etc of the base opt

1

1

0

0

108

Thomas Pethick @tmpethick

24 days ago

@_arohan_ Yes, non-constant is interesting to explore. The GPA paper by @aaron_defazio has a very nice perspective on diloco also through schedule-free, which makes the delta/diff from SODA easier to understand (I've compared in the related work section)

tmpethick's tweet photo. @_arohan_ Yes, non-constant is interesting to explore. The GPA paper by @aaron_defazio has a very nice perspective on diloco also through schedule-free, which makes the delta/diff from SODA easier to understand (I've compared in the related work section) https://t.co/MMCNyus0MI

0

1

0

2

54

Thomas Pethick @tmpethick

25 days ago

@wen_kaiyue @tonysilveti Your comment about optimism is interesting. I mainly focused on extracting a schedule for weight decay in this work, but there is an interesting question on how to schedule optimism and weight decay in tandem to better exploit smoothness

0

1

0

0

34

Thomas Pethick @tmpethick

25 days ago

@wen_kaiyue @tonysilveti Its actually not quite batch size independent - if you squint the convergence theorem suggests a constant then 1/√k (the changepoint will depend on noise), but we didn't investigate this much empirically yet

1

2

0

0

61

Thomas Pethick @tmpethick

25 days ago

@tonysilveti Yeah, except even for MLP blocks RMSNorm(W2σ(W1)) its ok for both matrices as long as the activation function is positively homogeneous (e.g., true for ReLU and ReLU^2)

0

1

0

0

84

Thomas Pethick @tmpethick

25 days ago

Why is Frobenius weight normalization ok when combined with non-Euclidean steepest descent methods? A short note: https://t.co/Yuw6wK7uAt

0

21

2

15

2K

Last Seen Users on Sotwe

Trends for you

Most Popular Users