Optimizers arenโt all created equal.โก
@teodorasrec dives into Adam vs. SGD, the hidden role of batch size, and why our assumptions about training transformers might need rethinking.
Watch the full session here: https://t.co/Kf4c9DhBp7
Join our ML Theory group as they host @teodorasrec on Thursday, August 28th for a session that explores the performance gap between Adam and SGD optimizers in language models, finding that SGD with momentum can match Adam's performance in small-batch settings when properly tuned.
Thanks to @itsmaddox_j, @aniervs and @ThangChu77 for organizing this session ๐ฅ
Learn more: https://t.co/wxIYlrGDrU