I was reading blog post "How we achieved truly serverless GPUs" by Modal (https://t.co/CQ39MppPwG). The engineering is pretty insane. I came across "CUDA checkpoints," which I somehow never saw before. They can be used alongside CRIU to checkpoint and restore process' state from disk.
So I added it to my Rust CUDA crate.
API design for my NCCL Rust wrapper (not final).
Creates device group with one rank per device, creates send/recv buffers, runs all-reduce (take every rank’s input buffer, sum element-wise across ranks, and write the same result to every rank). It hides direct calls to `ncclCommInitAll`, `ncclGroupStart`, etc., and hopefully makes it easier to use in a correct way.
I've built a crate for using NVIDIA Management Library (NVML) in Rust.
Safe and raw bindings. FFI bindings have documentation comments that are near-identical to official docs.
@boshen_c I actually thought about porting it, but too busy porting some other stuff currently 😀 From the benchmarks I've seen, it seems it would be a big W.
I thought about something similar years ago. I even built a "unified language compiler" and a runtime (as a wasm target) that you can hook into however you want. Kind of like a universal plugin system. It could turn on/off settings into a full programmable flow graph (similar to Unreal Engine's blueprints) and also change ui. The problem is, it's not clear what kind of users are the target audience of something so extremely extensible. I have some ideas, but not sure in what way it'd be worth it.
I ported Nvidia's cuDNN frontend library from C++ to Rust. I needed it in Rust and wanted more control over the internals. cuDNN-frontend provides a simpler graph API over the low-level cuDNN backend. You build a graph of operations, and it handles the lowering to backend APIs, kernel fusion, autotuning, version compatibility expansion, virtual tensor tagging, etc.
I also added more features like a nicer way to interleave custom kernels on the same CUDA graphs as cuDNN. It’s still incomplete and not open source yet.
Here's a snippet from the official examples, NVFP4 matmul and FP16 SDPA, and my own version (not final):
@theo That makes it worse. The amount of engineers freaking out about 10-20% better model is crazy. Like, there literally won't be anything novel about it.
@theo@ssijak Is this about the npm scripts, "clean" being overridden in package.json? I think the obvious correct choice is to clearly separate built-in commands from npm scripts (always run with "npm run ..."). Implicitly changing behavior project-to-project seems bad.