Fearless Concurrency on the GPU
For those interested @melibol just posted a paper on building a safe Rust kernel programming abstraction on top of Tile IR.
https://t.co/MMPxi4oOEg
A short teaser: but the safety is effectively free. On a B200, the safe GEMM is competitive with cuBLAS: about 2 PFlop/s 92% of the GPU's dense f16 roofline.
Read more in the paper or Melih's LinkedIn post (https://t.co/jyyfdC2Vc8)
He will also be giving a talk at RustConf in September, hopefully he will see you there!
I implemented @GoogleResearch's TurboQuant as a CUDA-native compression engine on Blackwell B200.
5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory.
5 custom cuTile CUDA kernels ft:
- fused attention (with QJL corrections)
- online softmax
-on-chip cache decompression
- pipelined TMA loads
Try it out: https://t.co/m5vkJxWIY6
s/o @blelbach and the cuTile team at @nvidia for lending me Blackwell GPU access :)
cc @sundeep@GavinSherry
🌅 BASIC is BACK!
In response to overwhelming demand from seasoned developers everywhere, we’re releasing cuTile BASIC for GPUs, bringing CUDA Tile programming to this long-overlooked language.
🧵 👇
Today, NVIDIA is launching the next paradigm shift in GPU programming: cuTile BASIC
Write perf portable BASIC kernels and deploy them at any scale from edge inference devices like your calculator to entire GPU clusters
We're going back to BASIC
https://t.co/meF2T0jUSc
The CUDA Tile roadmap:
- SIMT/Tile interop.
- Comms.
- New frontend languages.
Come to my talk at GTC in 30 minutes to learn more.
https://t.co/bSqhnviRRc
🎉 CUDA 13.2 just dropped, and GPU programming just got simpler.
This release expands CUDA Tile support to Ampere and Ada GPUs while delivering a stronger CUDA Python stack for cluster-scale workloads.
What's new:
✅ Install cuTile Python directly from PyPI: pip install cuda-tile
✅ Enhanced CUDA Python profiling and debugging across Numba-CUDA flows and Nsight tools
✅ Modern CUDA C++ and refreshed math libraries optimized for AI and HPC kernels
Ready to accelerate your workflows?
📝 Read the technical deep dive: https://t.co/pE5UcJZqXU
First we had one child and I thought I knew what children are like. Our second child was completely different; I’d overgeneralized. There are actually two types of children.
@karpathy Creating software might evolve into starting with a “blank slate” app, interacting with it until it’s sufficiently trained and save the “image”, which will continue to be incrementally trained by normal usage.
I’ll be giving a talk on TVM-FFI at @GPU_MODE this week! We will discuss how open ABI and FFI facilitate a fast, robust, and seamless framework interop experience across DSLs and kernel libraries.
I wrote a short proof showing that any self-hosting compiler cannot perform certain legal optimizations.
Would love feedback from compiler folks - does the proof look correct, and is it already well known?
Link: https://t.co/6rg61YhO1r
@oisyn@sparr0 I think the argument needs to be made more rigorous - there is no requirement that the compiler constant folds Compile(#P) to optimize the comparison. E.g. a compiler can optimize `X+1-1==X` to true without constant folding the LHS (`X+1-1`).
@oisyn@sparr0 > As #P includes itself, it needs to evaluate itself for constant folding
This could be an alternate proof to why the compiler cannot optimize the comparison `Compile(#P) == ...`. However, (contd)