This repository is a public log of my learning, experiments, and projects as I dive deep into:
- GPU architecture
- CUDA programming
- Memory hierarchies
- Parallelism
- Acceleration for deep learning and scientific computing
https://t.co/MzcmIPOBwe
@IlysMoutawwakil You can do W^T @ X^T instead of W@X and then use smaller N. This is a bit annoying if you want to chain matmuls. You can also just write 2 kernels, small M one wouldn't benefit that much from tcgen05/wgmma https://t.co/DPdCDRGB8v
This article is full of practical, tested techniques for high-performance kernel engineering on AMD.
Here's one of my favorites.
None of this would have been possible without @HotAisle 🫶
cc: @simran_s_arora@_williamhu@realDanFu@Neoblizzz@Drewwad - HK is an inspiration
Finally wrote a blog on the Blackwell 2CTA GEMM kernel I’ve been working on.
It covers the kernel pipeline, buffering, TMA loads, and TMEM-based accumulation.
Feedback is welcome :)
https://t.co/tRcv0X7kxI
Personally hiring Distributed Systems Engineers in San Francisco.
Rust and Python.
One of the fastest growing companies, to date. With the leader in the vertical.
Fearless Concurrency on the GPU
For those interested @melibol just posted a paper on building a safe Rust kernel programming abstraction on top of Tile IR.
https://t.co/MMPxi4oOEg
A short teaser: but the safety is effectively free. On a B200, the safe GEMM is competitive with cuBLAS: about 2 PFlop/s 92% of the GPU's dense f16 roofline.
Read more in the paper or Melih's LinkedIn post (https://t.co/jyyfdC2Vc8)
He will also be giving a talk at RustConf in September, hopefully he will see you there!
Here is a tutorial for uninitiated for QR for those wanting to enter the competition.
All you need to know is a little bit of algebra. Here @ means matrix multiplication.
Let A be a 3 x 3 matrix and task is given A can you find Q and R. A special property of Q is that Q @ Q.T = I and R is a upper triangular matrix:
A find Q and R.
Q @ R = A
[ [0, 1, 0], [ [2, 3, 4], [ [0, 5, 6],
[1, 0, 0], @ [0, 5, 6], = [2, 3, 4],
[0, 0, 1] ] [0, 0, 7] ] [0, 0, 7] ]
The R matrix has a special property:
[ [2, 3, 4],
[0, 5, 6],
[0, 0, 7] ]
We are generally fine with any R matrix with this structure, see the zeros on the lower triangle.
[ [0, 5, 6],
A = [2, 3, 4],
[0, 0, 7] ]
First column of A:
[ [0],
x = [2],
[0] ]
Householder matrix:
[ [0, 1, 0],
H = [1, 0, 0],
[0, 0, 1] ]
Now compute Hx:
[ [0, 1, 0], [ [0],
H x = [1, 0, 0], @ [2],
[0, 0, 1] ] [0] ]
[ [2],
= [0],
[0] ]
Now how do we get H? Well it turns out these are called reflectors. The householder reflectors. Think of putting a mirror on your vector.
fun fact cuBLAS and cuDNN specialize on alignment too and for user-managed heuristics caches like for cuDNN it’s a cache miss if you have bad alignment
the only thing preventing this is default allocator alignment and the implicit agreement that you will never do tensor[…, 1:]
We worked with @lmsysorg and https://t.co/Cg0JsVomui to
- integrate DFlash spec into @sgl_project
- make it faster with overlap
- train a DFlash drafter for @Alibaba_Qwen 397B-A17B
The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.
We've released the QR problem, a more robust qr_v2 with a fresh leaderboard so please resubmit!
Thank you to @blelbach, @myainotez and @nikhilbarhate99 for sharing feedback. Sorry if I missed anyone!
I considered automatically backfilling all submissions but the rankings do change quite a bit so I figured a refresh would be better.
Changelog
* Fail submissions if they fail when we change random seeds
* Add nasty correctness cases with more degenerate inputs in mixed batches
* Recheck correctness when doing perf testing to avoid Volkswagen cheat
* Reject Nan/Inf residuals
* Validate each matrix factorization residual, since averaging was hiding bad matrices
* Old QR is still open so folks can't see submissions but you can't submit anything to it
Wontfix
* Stream hacking is still banned via very blunt ban of the word "stream" we don't have a good solution for this
* CUDA graphs are allowed but not particularly interesting to us
Best submissions so far if I resubmit their solutions are
Another interesting find. CopyBulkG2SOp is lowered to cp.async.bulk.shared::cluster instead of .shared::cta, and CuteDSL automatically inserts mapa, even when threadblock cluster is not used.
Had to go back to my ole reliable PTX 🔫
Launching a new kernel competition: Linear Algebra Kernels For The Age Of Research.
First problem: batched QR decomposition on B200. Old math, modern hardware.
Prize: Rare swag and hangout in SF
Testing Mythos for GPU kernel generation. I will test it under 3 kernels: DSA, GDN and MoE routing, let's see how it performs over Opus 4.7 that previously won the contest against humans for DSA track.
Finally got some time to port my tcgen05 kernels to CuteDSL. For PTX enjoyers, this should feel natural (except TMA 🤣).
A BF16 MMA mainloop is shown below. I also worked up an example for MXFP8 and NVFP4.
New NanoGPT Speedrun WR at 79.7 (-1.5s) from @TrianX , with a brilliant solution to hash collisions on the bigram hash embedding. Instead of every bigram in a bucket returning the same embed, a secondary hash gives each bigram its own ±1 sign pattern (one of 8192), applied element-wise, e.g. x·[1,−1,1,1,…]. Each bigram in the bucket then reads a different partial reflection of the one stored row. https://t.co/2OLWs5osOG