Zachary Charles

Verified account

@MatharyCharles

distributed machine learning @ google | sometimes mathematician

Seattle

Joined September 2012

430 Following

1.4K Followers

922 Posts

MatharyCharles retweeted

about 1 month ago

It's been a delight to provide small amounts of advice and suggestions to people working on the Decoupled DiLoCo training system. This approach enables graceful handling of failures in large scale training jobs, by allowing (N-1) / N units to proceed when one fails. Thread ⬇️

23

684

75

242

109K

MatharyCharles retweeted

Arthur Douillard

about 1 month ago

The DiLoCo team at Google DeepMind and Google Research is proud to release Decoupled DiLoCo, the next frontier for resilient AI pre-training. Decoupled DiLoCo enables training with datacenters across the world, using heterogeneous hardware, and never halting the system despite hardware failures.

33

604

85

301

3M

Zachary Charles

@MatharyCharles

2 months ago

How do people write math papers with AI these days? What scaffolding/agents/editors do people find helpful?

0

1

1

0

314

Zachary Charles

@MatharyCharles

2 months ago

To clarify: I think the paper is cool! But the idea of applying JL transformer and random rotations for compression is a very well understood mechanism

0

2

0

0

188

Who to follow

Andrea Montanari

Professor, Statistics and Mathematics, Stanford University. (Opinions are my own)

Kartik Sreenivasan

Research scientist at MosaicML/Databricks. PhD from UW-Madison. Interested in LLMs, optimization, and the meaning of life.

Verified account

Professor @UCLA, Ex-ByteDance Seed | Recent work: Seed2.0, SeedFold, SeedProteo | Opinions are my own

Zachary Charles

@MatharyCharles

2 months ago

Compression for gradients (or models in distributed training) have been using these techniques for decades. I'm sure there are nuances to what properties you want for KV cache compression but all of these techniques are basically well trodden.

Arya Mazumdar @MountainOfMoon

2 months ago

@yoavgo @BlackHC Sorry, this entire discourse is a bit laughable, because 100s of papers uses random projection followed by sign quantization. There is no need to call it q JL, because there are important differences and this goes by the name of 1-bit compressed sensing, hyperplane hashing etc

1

21

1

6

4K

1

12

1

11

2K

MatharyCharles retweeted

Arya Mazumdar @MountainOfMoon

2 months ago

@yoavgo @BlackHC Sorry, this entire discourse is a bit laughable, because 100s of papers uses random projection followed by sign quantization. There is no need to call it q JL, because there are important differences and this goes by the name of 1-bit compressed sensing, hyperplane hashing etc

1

21

1

6

4K

Zachary Charles

@MatharyCharles

2 months ago

This exodus has been brutal, so many people leaving whose work I have thoroughly appreciated over the years

Luca Soldaini 🎀

2 months ago

After 4yrs, today is my last day at @allen_ai It was an honor to work on Olmo, Dolma, olmOCR, Tulu, Molmo & other fully-open artifacts 🫡 Reception has been amazing & their adoption makes me SO PROUD 🥹 Team is super committed to open recipes; can't wait to see what's next!!!!

soldni's tweet photo. After 4yrs, today is my last day at @allen_ai

It was an honor to work on Olmo, Dolma, olmOCR, Tulu, Molmo & other fully-open artifacts 🫡 Reception has been amazing & their adoption makes me SO PROUD 🥹

Team is super committed to open recipes; can't wait to see what's next!!!! https://t.co/F7WRaHfdqg

68

581

9

21

33K

0

8

0

0

1K

Zachary Charles

@MatharyCharles

2 months ago

@dlwh Oh very cool to see hyperball in action! Did you also normalize the updates?

0

0

0

0

60

Zachary Charles

@MatharyCharles

2 months ago

@Dorialexander Tbh I might even bet that one could make RNNs + SOTA data better than SOTA transformers + old data.

1

7

0

0

3K

Zachary Charles

@MatharyCharles

2 months ago

@Anthony_Bonato To be fair this is a very important step in the mathematical process

0

0

0

0

53

Zachary Charles

@MatharyCharles

2 months ago

Fun fact: you can use this to easily show that for any finite field F and degree n, there is an irreducible polynomial of degree n with coefficients in F.

Algebra Etc. @AlgebraFact

2 months ago

For every field F, there is a an algebraically closed field K that contains F.

2

62

2

4

5K

0

1

0

0

298

Zachary Charles

@MatharyCharles

2 months ago

@kalomaze Baba is AGI is honestly not the worst benchmark I can think of

1

17

0

1

795

Zachary Charles

@MatharyCharles

2 months ago

This is just not true. Crack open a copy of Roger & Horn and behold the many, many non-obvious facts about linear algebra.

little grey mouse 🐭 @mouse_math

2 months ago

probably more than any other field of mathematics, linear algebra strikes me as a bunch of trivial (ie obvious) facts, yet the end result is impressively powerful.

mouse_math's tweet photo. probably more than any other field of mathematics, linear algebra strikes me as a bunch of trivial (ie obvious) facts, yet the end result is impressively powerful. https://t.co/OOtgyBuegg

24

644

38

210

49K

0

17

0

9

2K

Zachary Charles

@MatharyCharles

2 months ago

This is a very good paper, highly recommend. My one gripe - the single learner DiLoCo case (called SNOO in the meta paper) has been known to improve AdamW well-before it was called SNOO, e.g. see: * https://t.co/DMWYyc6cQA - Table 3 * https://t.co/9iCIk8oZEb - Figure 2 & 3 (and more) I guess we should've given it a fun name to take credit for this?

Hao-Jun Michael Shi @hjmshi

2 months ago

1/10 Are DiLoCo and Schedule-Free actually related? A brief history and unusually late advertisement for our work: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (see https://t.co/ESaSU8kwpx).

hjmshi's tweet photo. 1/10 Are DiLoCo and Schedule-Free actually related? A brief history and unusually late advertisement for our work: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (see https://t.co/ESaSU8kwpx). https://t.co/6JXlucv3iC

1

37

12

21

10K

0

19

2

11

2K

Zachary Charles

@MatharyCharles

2 months ago

@anpaure @Romy_Holland I think it unfortunately gets worse the more specialized a topic is. Wikipedia is famously bad for many math concepts, and in other fields is often used to drum up attention to people's (bad) papers.

0

1

0

0

102

Zachary Charles

@MatharyCharles

2 months ago

@jxbz @yuxiangw_cs Very cool! Can you point me to anything that talks about what you referenced regarding signSGD and compute optimality for LLMs?

1

2

0

0

142

Zachary Charles

@MatharyCharles

2 months ago

@eliebakouch my job as an AI researcher is preventing me from spending time learning about AI research many such cases

0

5

0

0

235

Zachary Charles

@MatharyCharles

2 months ago

@natalienkhalil @arxiv Why is it imperative to have survey papers allowed on arXiv? There are so many ways to disseminate such things, whereas in many fields of math (for example), new submissions to arXiv are used as a signal for what to read in a given day.

1

1

0

0

53

Last Seen Users on Sotwe

Trends for you

Most Popular Users