Fabian Schaipp

@FSchaipp

working on optimization for machine learning. currently postdoc @inria_paris.

Paris, France

Joined July 2020

755 Following

1.3K Followers

522 Posts

Pinned Tweet

Fabian Schaipp @FSchaipp

over 1 year ago

Learning rate schedules seem mysterious? Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇 https://t.co/DGHoG1FS3f

over 1 year ago

The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting! 1/2

aaron_defazio's tweet photo. The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting!
1/2 https://t.co/7oYWqWpn9J

1

243

24

178

44K

5

141

27

90

32K

Fabian Schaipp @FSchaipp

7 days ago

@ruuustem_10 @orvieto_antonio @n_ajroldi yeah, agree! i guess you can get sth similar for the smooth case too, but probably need to adapt the alpha< 2/L step

0

1

0

0

167

Fabian Schaipp @FSchaipp

8 days ago

"It's easier to tune the LR for method A than for B." We tried to formalize this for model-based stochastic optimization methods. We find a key quantity, called stability index, that describes how stable a (weakly) convex bound is as a function of LR. 📚https://t.co/JIrG0gXqXL

FSchaipp's tweet photo. "It's easier to tune the LR for method A than for B."

We tried to formalize this for model-based stochastic optimization methods.

We find a key quantity, called stability index, that describes how stable a (weakly) convex bound is as a function of LR.

📚https://t.co/JIrG0gXqXL https://t.co/9YM5R7b1QN

3

65

9

43

7K

Fabian Schaipp @FSchaipp

8 days ago

The theory is with SGD as base update step. But all these adaptive step sizes can be used to obtain practical methods. For example, Polyak step sizes in combination with - Muon (@CrichaelMawshaw): https://t.co/ztFcHNWzmL - ScheduleFree (@aaron_defazio ): https://t.co/HBsVJ20RjJ

0

7

2

2

359

Who to follow

Konstantin Mishchenko

Verified account

Research Scientist @AIatMeta Previously Researcher @ Samsung AI Outstanding Paper Award @icmlconf 2023 Action Editor @TmlrOrg I tweet about ML papers and math

Assistant Professor @JohnsHopkinsAMS, Optimization, PhD @Cornell_ORIE Mostly here to share pretty maths/3D prints, sometimes sharing my research

Verified account

Research Scientist at Meta Superintelligence Labs working on optimization algorithms. Fundamental AI Research (FAIR) team

Fabian Schaipp @FSchaipp

8 days ago

Joint work with @gowerrobert and @TaylorAdrien , and just accepted at #ICML2026.

1

2

1

0

454

Fabian Schaipp @FSchaipp

13 days ago

Polyak step size is back!

14 days ago

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://t.co/LzjIIsOlG8

aaron_defazio's tweet photo. 🚨 New Paper 🚨
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training!
https://t.co/LzjIIsOlG8

7

420

56

303

85K

0

38

0

19

6K

Fabian Schaipp @FSchaipp

19 days ago

@maxzimmerberlin @spokutta Glückwunsch, Dr. Zimmer!

0

1

0

0

85

Fabian Schaipp @FSchaipp

23 days ago

NeurIPS 2025 Proceedings are finally online!

FSchaipp's tweet photo. NeurIPS 2025 Proceedings are finally online! https://t.co/FV8qNhm9nP

0

7

0

1

447

Fabian Schaipp @FSchaipp

about 1 month ago

@konstmish when I tested CWD, it looked much better than the baseline for a long time, then lost all its advantage during cooldown. 😢 Was for a relatively short run though (D~20N).

0

4

0

0

224

Fabian Schaipp @FSchaipp

about 2 months ago

Nice result! (from https://t.co/O8sr1pvZhe) no anytime-schedule can obtain the optimal rate for (S)GD. to my knowledge, WSD is the closest candidate we know of, as it removes the log-factor in the rate for any cooldown length proportional to T.

FSchaipp's tweet photo. Nice result! (from https://t.co/O8sr1pvZhe)

no anytime-schedule can obtain the optimal rate for (S)GD.

to my knowledge, WSD is the closest candidate we know of, as it removes the log-factor in the rate for any cooldown length proportional to T. https://t.co/J4VUSXkzlz

1

58

6

38

5K

Fabian Schaipp @FSchaipp

2 months ago

Going to Zurich for a couple of days. I will give a talk on recent optimization stuff @zurichnlp. Always happy to chat 🍫

FSchaipp's tweet photo. Going to Zurich for a couple of days. I will give a talk on recent optimization stuff @zurichnlp. Always happy to chat 🍫 https://t.co/djUvdxFiMf

3

29

1

0

3K

Fabian Schaipp @FSchaipp

2 months ago

@ruuustem_10 @YouJiacheng @CevherLIONS not sure i follow. the steplaw is not restricted to a fixed TPP.

1

1

0

0

63

Fabian Schaipp @FSchaipp

2 months ago

@CevherLIONS @YouJiacheng thanks for clarifying! so the batch size scaling is also only applicable to Scion?

1

1

0

0

26

Fabian Schaipp @FSchaipp

3 months ago

@JFPuget yes! https://t.co/HoNlIwhv1Q

1

5

0

1

374

Fabian Schaipp @FSchaipp

4 months ago

@rishabh16_ Muon for material foundation model: https://t.co/vNXpOOtkEU Not Bio, but closely related: https://t.co/q87mHqymyj

1

7

0

1

451

Fabian Schaipp @FSchaipp

4 months ago

After LLMs and diffusion, Muon also shines on tabular foundation models! Also nice to see they used cautious weight decay 🥌

David Holzmüller @DHolzmueller

4 months ago

Super excited that TabICLv2 is out 🎉 🚀Beats RealTabPFN-2.5 with no tuning and purely synthetic pre-training data. 👉Introduces QASSMax for long-context generalization, early target embedding, repeated feature grouping, Muon, etc., and a much diversified synthetic data prior.

1

45

11

15

6K

1

15

0

4

1K

Fabian Schaipp @FSchaipp

4 months ago

@ADarmouni that's not how research works

0

20

0

1

1K

Fabian Schaipp @FSchaipp

4 months ago

not to offend anyone, but how tf do these papers get through review when not even the LR of the baseline is properly tuned?

4 months ago

Learning rate matters more than your LoRA variant. In this study they sweep LR hard across LoRA variants (DoRA, Init[AB], PiSSA, MiLoRA) and find: > If you tune LR properly, they all converge to approx the same peak perf. > Rank still matters and can flip which variant looks best depending on dataset. > Optimal learning rate is a function of how steeply curved the loss is: >> more curved → smaller steps (lower LR) >> less curve → larger steps (higher LR)

ZainHasan6's tweet photo. Learning rate matters more than your LoRA variant.

In this study they sweep LR hard across LoRA variants (DoRA, Init[AB], PiSSA, MiLoRA) and find:

> If you tune LR properly, they all converge to approx the same peak perf.

> Rank still matters and can flip which variant looks best depending on dataset.

> Optimal learning rate is a function of how steeply curved the loss is:

>> more curved → smaller steps (lower LR)
>> less curve → larger steps (higher LR)

10

162

17

137

38K

8

154

6

87

29K

Last Seen Users on Sotwe

Trends for you

Most Popular Users