Eric Schreiber

@schreiberic

Faster models, bigger questions

Joined January 2021

194 Following

226 Followers

44 Posts

Pinned Tweet

Eric Schreiber @schreiberic

4 months ago

NVIDIA’s CuTe layouts are gaining traction. I wanted to see why everyone loves them. The basics were easy, but intermediate resources beyond matrix algebra were scarce. So I wrote a blog post sharing my journey, building up to a GEMM kernel that can beat cuBLAS 🧵

schreiberic's tweet photo. NVIDIA’s CuTe layouts are gaining traction. I wanted to see why everyone loves them. The basics were easy, but intermediate resources beyond matrix algebra were scarce. So I wrote a blog post sharing my journey, building up to a GEMM kernel that can beat cuBLAS 🧵 https://t.co/bE7SBfnMlR

6

117

11

152

13K

Eric Schreiber @schreiberic

about 1 month ago

@tugot17 @UniofOxford So cool! congrats 🙌🏻

1

0

0

0

34

Eric Schreiber @schreiberic

about 1 month ago

@TomaszSternal Huge congrats 🙌🏻

0

1

0

0

30

Eric Schreiber @schreiberic

2 months ago

At ICLR. Let’s connect and chat: hardware, CUDA, architecture, pre/post-training, and whatever’s got you excited.

0

4

0

0

155

Eric Schreiber @schreiberic

2 months ago

@tugot17 🫡 Great as always. Why do you think there are not a lot more models that perform continued pre-training to sparsify their attention?

1

1

0

0

189

Eric Schreiber @schreiberic

3 months ago

Nice one! I can tell you put a lot of effort into this post. I’ve started reading it and will need some time to go through it all :) I did it the opposite way, started with CuTe (bit of self-promotion: have als a blogpost) and now looking into the mxfp8 and fp4 stuff.

0

0

0

0

167

Eric Schreiber @schreiberic

3 months ago

@yacinelearning @jonashubotter Reasoning trace length increasing is usually a good proxy for a healthy GRPO run. This method, to me, produces a strong instruct model that performs well without verbose reasoning (which is fantastic). However, for ood cases, preserving backtracking likely still matters.

1

1

0

0

41

Eric Schreiber @schreiberic

3 months ago

@yacinelearning @jonashubotter Don’t get me wrong, I love the paper. However, I see some weaknesses in the method. Generalization may be challenging because reasoning traces are heavily reduced, pushing the model to jump straight to the correct answer (https://t.co/TcXOG7bruq).

2

2

0

0

77

Eric Schreiber @schreiberic

3 months ago

@SzymonOzog_ you realise just how much you‘re standing on the shoulders of giants

0

1

0

0

27

Eric Schreiber @schreiberic

3 months ago

@willccbb Thought this recent work from @jonashuebotter was pretty cool: For SFT: https://t.co/fHiGjDCsTl For RL: https://t.co/RxNo5TjxGP

0

2

0

1

146

Eric Schreiber @schreiberic

3 months ago

@karpathy @maxbittker For me it helped to create a human-opus-interaction.txt file for outputs and interaction, telling the model an absurd and unrealistic goal (kernel duration, target loss …) and not to come back to me until it has achieved it. Prolonged the loop significantly

0

0

0

0

22

Eric Schreiber @schreiberic

4 months ago

@maharshii When I started trying it out last summer it was awful. Since this year, given some initial ideas, it‘s been pretty neat. Also running in a loop to improve an implementation works quite well too

0

0

0

0

195

Eric Schreiber @schreiberic

4 months ago

@tri_dao Once I got the hang of CuTe I'm loving it as well. The compile time is amazing! But the entry barrier feels huge. Feels like you need to know CUDA and have a PhD in math before you can even begin

0

0

0

0

70

Eric Schreiber @schreiberic

4 months ago

Feedback very welcome. Blog: https://t.co/lnIdhrpujO Code & Profiles: https://t.co/Nsd9hOHOIP

0

2

0

0

247

Eric Schreiber @schreiberic

4 months ago

Last week we explored NVIDIA's CuTe layouts. Today, we put that theory into practice. Part 2 is out now! Most CuTe examples skip straight to highly optimized code without explaining the reasoning. Join me as we build a MM kernel with CuTe, that can beat cuBLAS in certain cases 🧵

schreiberic's tweet photo. Last week we explored NVIDIA's CuTe layouts. Today, we put that theory into practice. Part 2 is out now! Most CuTe examples skip straight to highly optimized code without explaining the reasoning. Join me as we build a MM kernel with CuTe, that can beat cuBLAS in certain cases 🧵 https://t.co/HIWEz0vmOJ

3

190

22

189

8K

Eric Schreiber @schreiberic

4 months ago

The final kernel reaches 116% of cuBLAS on A100s for 2048×2048 BF16 matrices. Though to be fair, the specific lead at 2048 is more likely due to them underperforming rather than to my ingenuity

1

1

0

0

262

Eric Schreiber @schreiberic

4 months ago

@elliotarledge check out https://t.co/aq1Lfzc6Fp

0

0

0

0

17

Last Seen Users on Sotwe

Trends for you

Most Popular Users