NVIDIA’s CuTe layouts are gaining traction. I wanted to see why everyone loves them. The basics were easy, but intermediate resources beyond matrix algebra were scarce. So I wrote a blog post sharing my journey, building up to a GEMM kernel that can beat cuBLAS 🧵
Nice one! I can tell you put a lot of effort into this post. I’ve started reading it and will need some time to go through it all :)
I did it the opposite way, started with CuTe (bit of self-promotion: have als a blogpost) and now looking into the mxfp8 and fp4 stuff.
@yacinelearning@jonashubotter Reasoning trace length increasing is usually a good proxy for a healthy GRPO run. This method, to me, produces a strong instruct model that performs well without verbose reasoning (which is fantastic). However, for ood cases, preserving backtracking likely still matters.
@yacinelearning@jonashubotter Don’t get me wrong, I love the paper. However, I see some weaknesses in the method. Generalization may be challenging because reasoning traces are heavily reduced, pushing the model to jump straight to the correct answer (https://t.co/TcXOG7bruq).
@karpathy@maxbittker For me it helped to create a human-opus-interaction.txt file for outputs and interaction, telling the model an absurd and unrealistic goal (kernel duration, target loss …) and not to come back to me until it has achieved it. Prolonged the loop significantly
@maharshii When I started trying it out last summer it was awful. Since this year, given some initial ideas, it‘s been pretty neat. Also running in a loop to improve an implementation works quite well too
@tri_dao Once I got the hang of CuTe I'm loving it as well. The compile time is amazing!
But the entry barrier feels huge. Feels like you need to know CUDA and have a PhD in math before you can even begin
Last week we explored NVIDIA's CuTe layouts. Today, we put that theory into practice. Part 2 is out now! Most CuTe examples skip straight to highly optimized code without explaining the reasoning. Join me as we build a MM kernel with CuTe, that can beat cuBLAS in certain cases 🧵
The final kernel reaches 116% of cuBLAS on A100s for 2048×2048 BF16 matrices. Though to be fair, the specific lead at 2048 is more likely due to them underperforming rather than to my ingenuity