Learning rate schedules seem mysterious?
Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization.
Short thread on our latest paper π
https://t.co/DGHoG1FS3f
The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting!
1/2
@ruuustem_10@orvieto_antonio@n_ajroldi yeah, agree! i guess you can get sth similar for the smooth case too, but probably need to adapt the alpha< 2/L step
"It's easier to tune the LR for method A than for B."
We tried to formalize this for model-based stochastic optimization methods.
We find a key quantity, called stability index, that describes how stable a (weakly) convex bound is as a function of LR.
πhttps://t.co/JIrG0gXqXL
The theory is with SGD as base update step. But all these adaptive step sizes can be used to obtain practical methods. For example, Polyak step sizes in combination with
- Muon (@CrichaelMawshaw): https://t.co/ztFcHNWzmL
- ScheduleFree (@aaron_defazio ): https://t.co/HBsVJ20RjJ
π¨ New Paper π¨
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training!
https://t.co/LzjIIsOlG8
@konstmish when I tested CWD, it looked much better than the baseline for a long time, then lost all its advantage during cooldown. π’
Was for a relatively short run though (D~20N).
Nice result! (from https://t.co/O8sr1pvZhe)
no anytime-schedule can obtain the optimal rate for (S)GD.
to my knowledge, WSD is the closest candidate we know of, as it removes the log-factor in the rate for any cooldown length proportional to T.
Super excited that TabICLv2 is out π
πBeats RealTabPFN-2.5 with no tuning and purely synthetic pre-training data.
πIntroduces QASSMax for long-context generalization, early target embedding, repeated feature grouping, Muon, etc., and a much diversified synthetic data prior.
Learning rate matters more than your LoRA variant.
In this study they sweep LR hard across LoRA variants (DoRA, Init[AB], PiSSA, MiLoRA) and find:
> If you tune LR properly, they all converge to approx the same peak perf.
> Rank still matters and can flip which variant looks best depending on dataset.
> Optimal learning rate is a function of how steeply curved the loss is:
>> more curved β smaller steps (lower LR)
>> less curve β larger steps (higher LR)