Rishab Parthasarathy @rishab_partha - Twitter Profile

We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

jcz42's tweet photo. We made Muon run up to 2x faster for free!

Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition.

Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs.

Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else.

This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

17

1K

164

665

218K

Rishab Parthasarathy @rishab_partha

over 1 year ago

Had a great time working with the ARC Prize team testing r1! r1 seems to get o1-preview perf and is cheap too — another great OSS model in the ecosystem thanks to @mikeknoop @GregKamradt @ishanit5 for this opportunity!

ARC Prize

@arcprize

over 1 year ago

Verified DeepSeek performance on ARC-AGI's Public Eval (400 tasks) + Semi-Private (100 tasks) DeepSeek V3: * Semi-Private: 7.3% ($.002) * Public Eval: 14% ($.002) DeepSeek Reasoner: * Semi-Private: 15.8% ($.06) * Public Eval: 20.5% ($.05) (Avg $ per task)

19

1K

104

281

292K

3

38

1

4K

rishab_partha retweeted

Mathew Jacob @mat_jacob1002

over 1 year ago

It's time to revisit common assumptions in IR! Embeddings have improved drastically, but mainstream IR evals have stagnated since MSMARCO and BEIR. We ask: on private or tricky IR tasks, are current rerankers even better? Surely, reranking as many docs as you can afford is best?

mat_jacob1002's tweet photo. It's time to revisit common assumptions in IR! Embeddings have improved drastically, but mainstream IR evals have stagnated since MSMARCO and BEIR.

We ask: on private or tricky IR tasks, are current rerankers even better? Surely, reranking as many docs as you can afford is best? https://t.co/sLwCHMH2x6

5

150

42

74

46K

rishab_partha retweeted

Shashank Rajput @shashank_r12

over 1 year ago

Want to train inference-friendly models that use less memory and have higher throughput? We show that KV cache sharing between layers and adding sliding window layers can speed up inference while maintaining model quality. https://t.co/x83VqACq2h

shashank_r12's tweet photo. Want to train inference-friendly models that use less memory and have higher throughput? We show that KV cache sharing between layers and adding sliding window layers can speed up inference while maintaining model quality. https://t.co/x83VqACq2h https://t.co/c0ROwZwo2I

5

110

19

55

16K

Rishab Parthasarathy @rishab_partha

over 1 year ago

@vitaliychiley MIT :)

0

2

0

101

Rishab Parthasarathy @rishab_partha

almost 2 years ago

@sumanthrh @maxisawesome538 we try our best

0

2

0

67

Rishab Parthasarathy @rishab_partha

almost 2 years ago

@1thousandfaces_ @maxisawesome538 that one we can make

0

5

0

356

Rishab Parthasarathy @rishab_partha

almost 2 years ago

@bilaltwovec @maxisawesome538 actually not shutterstock but just base sdxl :( might be too boomer for brat summer

0

3

0

90

Rishab Parthasarathy @rishab_partha

almost 2 years ago

trained a bufo lora for fun on the side .... it's now on twitter the team at Mosaic is actually so fun

Max ⛅

@maxisawesome538

almost 2 years ago

credit to @rishab_partha !!! including some custom generated bufos here. disco bufo. bufo in amsterdam. bufo goes to yank sing. bufo lifts weights

maxisawesome538's tweet photo. credit to @rishab_partha !!!
including some custom generated bufos here. disco bufo. bufo in amsterdam. bufo goes to yank sing. bufo lifts weights https://t.co/w9lu0COAyr

1

9

0

1

4K

0

27

4

0

4K

rishab_partha retweeted

Zack Ankner

@ZackAnkner

almost 2 years ago

Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! https://t.co/CnYEDM36no

14

261

59

154

71K

rishab_partha retweeted

Dan Biderman

@dan_biderman

almost 2 years ago

*LoRA Learns Less and Forgets Less* is now out in its definitive edition in TMLR🚀 Checkout the latest numbers fresh from the @DbrxMosaicAI oven 👨‍🍳

5

82

20

45

36K

rishab_partha retweeted

Mansheej Paul

@mansiege

almost 2 years ago

Pretraining data ablations are expensive: how can we measure data quality fast and cheap? If you're at ICML, come find out at the ES-FoMo poster session today in Lehar 2 at 1 pm: https://t.co/pAMIWU7n0S

0

41

13

8

6K

Rishab Parthasarathy @rishab_partha

almost 2 years ago

I’m going to be at ICML 2024 presenting a workshop paper on 3D video modeling! Down to talk about anything LLMs, multimodality / new CV modalities, and ML systems!

0

6

0

334

rishab_partha retweeted

Zack Ankner

@ZackAnkner

almost 2 years ago

Hydra was accepted to COLM! Going to be dropping some new perf improvements and batched decoding support as well soon 😁

2

61

5

8

6K

Rishab Parthasarathy @rishab_partha

almost 2 years ago

Alternate view of a jumping fox!

0

3

0

178

Rishab Parthasarathy @rishab_partha

almost 2 years ago

We are excited to announce Vid3D, a technique for generating 3D video using only 2D video diffusion models and Gaussian splatting! Paper: https://t.co/RnbnyRZHJU Github: https://t.co/ZmYJEe6hOb Project Page: https://t.co/gYQXnb9xkX

3

31

11

4

8K

Rishab Parthasarathy @rishab_partha

almost 2 years ago

Main view of a jumping fox!

1

3

0

207

Rishab Parthasarathy

@rishab_partha

Last Seen Users on Sotwe

Trends for you

Most Popular Users