Excited to share new ThunderKittens attention kernels that match or outperform Flash Attention 4 on Blackwell GPUs! Currently only supports QK192/V128 shapes, but more coming soon.
Check out the code here: https://t.co/cYtVw0x34r
Shoutout to the FA4 team for the algorithmic innovations and to @stuart_sul for the helpful discussions.
(1/7) We're releasing ThunderKittens 2.0! Faster kernels, cleaner code, industry contributions, and new state-of-the-art BF16 / MXFP8 / NVFP4 GEMMs that match or surpass cuBLAS!
Alongside this release, we’re equally excited to share some insights we learned while squeezing every last TFLOP out of Blackwell:
(with @hazyresearch & generously supported by @cursor_ai)
my first blogpost related to GPUs! this one looks at pyutils, a small but important part of the ThunderKittens library that allows kernels to be launched with PyTorch. https://t.co/igBv4WRDlk
The @ilyasut episode
0:00:00 – Explaining model jaggedness
0:09:39 - Emotions and value functions
0:18:49 – What are we scaling?
0:25:13 – Why humans generalize better than models
0:35:45 – Straight-shotting superintelligence
0:46:47 – SSI’s model will learn from deployment
0:55:07 – Alignment
1:18:13 – “We are squarely an age of research company”
1:29:23 – Self-play and multi-agent
1:32:42 – Research taste
Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify. Enjoy!