At OpenAI, we're continuing to bet on Rust as the future of systems programming.
I'm proud to announce that we're making a $600,000 commitment to the Rust Foundation, which combines our Platinum membership with additional support for maintainer efforts across the Rust ecosystem.
NVIDIA researchers just brought Rust's ownership model to GPU kernels. ๐ฆ
The paper: "Fearless Concurrency on the GPU" introduces cuTile Rust.
The problem: writing custom GPU kernels in Rust meant stepping outside Rust's safety guarantees entirely.
cuTile Rust fixes that : mutable outputs get split into disjoint pieces, kernel launches preserve ownership rules from host to device, with local opt-outs when you need raw control.
The performance holds up:
โ 7 TB/s for element-wise ops on NVIDIA B200
โ 2 PFlop/s for GEMM : 96% of cuBLAS
โ Matches cuTile Python within measurement noise
They also built Grout, an inference engine on top, running real models:
โ 171 tokens/s for Qwen3-4B on RTX 5090
โ 82 tokens/s for Qwen3-32B on B200
โ Competitive with vLLM and SGLang
Safe, idiomatic Rust at full CUDA performance. This is a big step for Rust in ML infra.
๐ https://t.co/A5rN8ULiLl
#Rust #RustLang #GPU #CUDA #MachineLearning #SystemsProgramming #NVIDIA
Excited to share cuTile Rust: bringing Rust's fearless concurrency to GPU kernel programming. Our paper "Fearless Concurrency on the GPU" is now on arXiv.
For me the best part is how fast you can iterate: high-performance CUDA kernels developed directly in Rust.
Huge thanks to my co-authors Melih Elibol, Jared Roesch, Isaac Gelado, and Michael Garland on this project.
Finally able to talk about what I've been heads-down on for 6 months at @nvidia ๐ฆโก
We just open-sourced cuda-oxide โ an experimental rustc backend that lets you write CUDA kernels in pure Rust.
No DSLs. No FFI. No source-to-source step. Single source.
Short๐งต๐
Fearless Concurrency on the GPU
For those interested @melibol just posted a paper on building a safe Rust kernel programming abstraction on top of Tile IR.
https://t.co/MMPxi4oOEg
A short teaser: but the safety is effectively free. On a B200, the safe GEMM is competitive with cuBLAS: about 2 PFlop/s 92% of the GPU's dense f16 roofline.
Read more in the paper or Melih's LinkedIn post (https://t.co/jyyfdC2Vc8)
He will also be giving a talk at RustConf in September, hopefully he will see you there!
cuTile Rust: a safe, tile-based kernel programming DSL for the Rust programming language
https://t.co/BfdGEJ969w
features a safe host-side API for passing tensors to asynchronously executed kernel functions