Piyush Sao @piyusch - Twitter Profile

Pinned Tweet

3 months ago

Existing dl training loss and instability literature is looking at linear and quadratic coefficients to analyze convergence when actually the Taylor series diverges: https://t.co/01NCG8kmgT

piyusch's tweet photo. Existing dl training loss and instability literature is looking at linear and quadratic coefficients to analyze convergence when actually the Taylor series diverges: https://t.co/01NCG8kmgT https://t.co/TR4kgnnoXl

3

4

0

1

502

Piyush Sao

@piyusch

2 days ago

@boknilev Numerous appendix could be due to original paper being very long and neurips submission forced them to place all the proofs and setup details in appendix.

2

10

0

1

2K

Piyush Sao

@piyusch

9 days ago

Bottle gourd is my favorite. How you cut it makes all the difference, best is slices that are not too thin. Cook it with chana-dal, tomato and add fresh coriander crushed at the end with slightly watery consistency with usual indian spices (but keep them light to not overwhelm its taste).

0

1

0

189

Piyush Sao

@piyusch

19 days ago

@_arohan_ what is the relative measured cost of a shampoo step relative to Muon or Adamw?

0

308

Who to follow

AI Researcher at Illumina

Shaden Smith

@shaden_smith

Technical Staff at @MicrosoftAI. Prev. @InflectionAI, @MSFTDeepSpeed, and @Intel. Into horror, herpetology, and high performance computing. he/him

Piyush Sao

@piyusch

19 days ago

@zdeborova I wonder if analyzing the complex zeros of your high-dimensional attention model's attention partition functions along SGD directions could connect staged specialization to step-size safety meaningfully.

0

20

Piyush Sao

@piyusch

19 days ago

@zdeborova Paper: https://t.co/yX56cjrIEB complex zeros of the softmax partition function (the “ghosts of softmax”) create singularities that limit the Taylor convergence radius of the cross-entropy loss.

1

0

2

33

Piyush Sao

@piyusch

22 days ago

@mister_whistler @billy_boi6 You don't get to choose

0

77

Piyush Sao

@piyusch

25 days ago

@thegautamkamath Congratulations! Well deserved 👏

0

1

0

353

Piyush Sao

@piyusch

27 days ago

@yxy2168 are you sure about the proof? https://t.co/gkz3y5aANd

Piyush Sao

@piyusch

about 1 month ago

Really interesting paper. I’m confused by Appendix H: if the number of (W)-sparse networks is (W^{O(W)}), then a (2^{-W}) weight-decay prior seems too weak to normalize over variable architectures, since (W^{O(W)}2^{-W}) grows like (2^{O(W\log W)-W}). Is there an implicit canonical encoding or fixed architecture restriction I’m missing?

0

819

0

67

piyusch retweeted

Mao Keji | मुखर्जी

@kejimao

28 days ago

The diagram below visualizes what brain drain looks like in India's case. There's a strong positive relationship between the PISA scores of natives in countries and the scores of second-generation immigrants whose ancestry is from those countries. That India is the most striking outlier in the diagram tellingly demonstrates how the West harvests and siphons off the brightest Indian talent, as measured by average PISA scores, much more aggressively than it does to other countries.

kejimao's tweet photo. The diagram below visualizes what brain drain looks like in India's case.

There's a strong positive relationship between the PISA scores of natives in countries and the scores of second-generation immigrants whose ancestry is from those countries.

That India is the most striking outlier in the diagram tellingly demonstrates how the West harvests and siphons off the brightest Indian talent, as measured by average PISA scores, much more aggressively than it does to other countries.

28

153

29

56

35K

Piyush Sao

@piyusch

about 1 month ago

Really interesting paper. I’m confused by Appendix H: if the number of (W)-sparse networks is (W^{O(W)}), then a (2^{-W}) weight-decay prior seems too weak to normalize over variable architectures, since (W^{O(W)}2^{-W}) grows like (2^{O(W\log W)-W}). Is there an implicit canonical encoding or fixed architecture restriction I’m missing?

0

819

Piyush Sao

@piyusch

about 1 month ago

@scottnarmstrong I hope its sarcasm ( doesn't read that way to me).

1

2

0

1K

Piyush Sao

@piyusch

about 1 month ago

@Clashluke @HessianFree GN-CG ( assuming its Gauss-Newton with CG) is heavier: it chooses the direction via repeated (Jv/J^\top v), not just the scalar. So it doesn't give you an optimal LR for ADAM/PSGD direction. But GN-CG step is bound by the Taylor radius.

0

1

0

77

Piyush Sao

@piyusch

about 2 months ago

@Math_files People back then had a lot of free time

4

463

2

0

20K

Piyush Sao

@piyusch

about 2 months ago

@tokenbender I dont think paper's step sizing would work for softmax ce; but do give it a try and share your findings.

0

1

0

48

Piyush Sao

@piyusch

about 2 months ago

@f14bertolotti I agree that muon is not special. But the paper's quadratic loss model is exponentially wrong for softmax ce, so its lr theory won't translate to real-world LLM/transformer training;

0

1

0

416

Piyush Sao

@piyusch

about 2 months ago

@fly51fly Top ntk eigenvalue will exponentially overestimate safe step size for softmax ce so while elegant theory it doesnt apply to llm transformer training https://t.co/yX56cjrIEB

0

112