It's been a delight to provide small amounts of advice and suggestions to people working on the Decoupled DiLoCo training system. This approach enables graceful handling of failures in large scale training jobs, by allowing (N-1) / N units to proceed when one fails.
Thread ⬇️
The DiLoCo team at Google DeepMind and Google Research is proud to release Decoupled DiLoCo, the next frontier for resilient AI pre-training.
Decoupled DiLoCo enables training with datacenters across the world, using heterogeneous hardware, and never halting the system despite hardware failures.
To clarify: I think the paper is cool! But the idea of applying JL transformer and random rotations for compression is a very well understood mechanism
Compression for gradients (or models in distributed training) have been using these techniques for decades.
I'm sure there are nuances to what properties you want for KV cache compression but all of these techniques are basically well trodden.
@yoavgo@BlackHC Sorry, this entire discourse is a bit laughable, because 100s of papers uses random projection followed by sign quantization. There is no need to call it q JL, because there are important differences and this goes by the name of 1-bit compressed sensing, hyperplane hashing etc
@yoavgo@BlackHC Sorry, this entire discourse is a bit laughable, because 100s of papers uses random projection followed by sign quantization. There is no need to call it q JL, because there are important differences and this goes by the name of 1-bit compressed sensing, hyperplane hashing etc
After 4yrs, today is my last day at @allen_ai
It was an honor to work on Olmo, Dolma, olmOCR, Tulu, Molmo & other fully-open artifacts 🫡 Reception has been amazing & their adoption makes me SO PROUD 🥹
Team is super committed to open recipes; can't wait to see what's next!!!!
Fun fact: you can use this to easily show that for any finite field F and degree n, there is an irreducible polynomial of degree n with coefficients in F.
probably more than any other field of mathematics, linear algebra strikes me as a bunch of trivial (ie obvious) facts, yet the end result is impressively powerful.
This is a very good paper, highly recommend.
My one gripe - the single learner DiLoCo case (called SNOO in the meta paper) has been known to improve AdamW well-before it was called SNOO, e.g. see:
* https://t.co/DMWYyc6cQA - Table 3
* https://t.co/9iCIk8oZEb - Figure 2 & 3 (and more)
I guess we should've given it a fun name to take credit for this?
1/10 Are DiLoCo and Schedule-Free actually related? A brief history and unusually late advertisement for our work: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (see https://t.co/ESaSU8kwpx).
@anpaure@Romy_Holland I think it unfortunately gets worse the more specialized a topic is. Wikipedia is famously bad for many math concepts, and in other fields is often used to drum up attention to people's (bad) papers.
@natalienkhalil@arxiv Why is it imperative to have survey papers allowed on arXiv? There are so many ways to disseminate such things, whereas in many fields of math (for example), new submissions to arXiv are used as a signal for what to read in a given day.