๐ฆ Rust Tech | @Nvidia | Make general purpose GPU programming accessible ๐ Disclaimer: The views, opinions expressed are my own (not my employer's)
Finally able to talk about what I've been heads-down on for 6 months at @nvidia ๐ฆโก
We just open-sourced cuda-oxide โ an experimental rustc backend that lets you write CUDA kernels in pure Rust.
No DSLs. No FFI. No source-to-source step. Single source.
Short๐งต๐
We can now fully rewrite most software in @leanprover and prove it correct:
- Compiler module rewrite (AI) from Rust to Lean
- Full FFI integration
- All unit and integration tests pass
- Formal spec and proofs!!
- Under 20h wall time (unnoticed pauses)
https://t.co/u601dZ8wph
@Nekrolm@roeschinc@nvidia@Nekrolm - the index_2d bug-fix landed. The latest vecadd example uses the more ergonomic api - get_mut_indexed
and if you'd like to take a look at the entire fix - https://t.co/PssDX87Q8I, feel free to reopen if you have thoughts.
Thanks for flaggin this again.
@hazle111753854@nvidia@VaivaswathaN We haven't done an actual comparison yet but so far the experience is - its on par with Cuda C++.
Ps: needs thorough validation though.
Finally able to talk about what I've been heads-down on for 6 months at @nvidia ๐ฆโก
We just open-sourced cuda-oxide โ an experimental rustc backend that lets you write CUDA kernels in pure Rust.
No DSLs. No FFI. No source-to-source step. Single source.
Short๐งต๐
I know our community wonโt be around come June (as โXโ has other plans).
But before we go โ I thought Iโd drop something Iโve been working on for a while here.
Finally able to talk about what I've been heads-down on for 6 months at @nvidia ๐ฆโก
We just open-sourced cuda-oxide โ an experimental rustc backend that lets you write CUDA kernels in pure Rust.
No DSLs. No FFI. No source-to-source step. Single source.
Short๐งต๐
@dr_sensor Just a small correction. Kernel code doesnโt support async/await. But we can do async kernel launches. We have a few examples that use the #tokio runtime to launch GPU work asynchronously.
@VaivaswathaN ๐ฏ - I can say this after having worked with it over the past few months. Pliron addresses all of cuda-oxide's needs - extensibility, rust-native and a breeze to debug. Love it!
46 worked examples shipped โ async MLP, cross-crate kernels, Rust โ C++/CCCL device FFI.
Still alpha + under active development. So, expect bugs, missing features, API churn โ we think its a good start.
โญ https://t.co/ongDJBi7Ew
๐ https://t.co/OpYHBUd8YP
@Nekrolm@roeschinc@nvidia But I do think, it is actually more ergonomic (i.e. rusty) option if perf is not an issue. However, on another note - I'm working on a oxide-native MIR Inliner pass, which should solve the perf problem.
@Nekrolm@roeschinc@nvidia Tried that first. But perf cost adds up quickly for trivial index arithmetic -- at the method-call boundary the no-param `get_mut()` doesn't inline, so the index is computed twice plus call overhead. So, went with ThreadIndex version.