Today, NVIDIA is launching the next paradigm shift in GPU programming: cuTile BASIC
Write perf portable BASIC kernels and deploy them at any scale from edge inference devices like your calculator to entire GPU clusters
We're going back to BASIC
https://t.co/meF2T0jUSc
@NRA_Rifleman@SavageArms Savage 110 action is the most ridiculous Rube Goldberg contraption of the modern era.
Explain the floating rear baffle, please.
I want my money back.
Thrilled to announce we're open-sourcing the CUDA Tile dialect and bytecode! https://t.co/wpy2BoybAk
What's included:
• CUDA Tile MLIR dialect
• Bytecode serialization/deserialization support
• MLIR Python bindings for programmatic IR construction
• Conformance test suite
For developers: You can now integrate CUDA Tile directly into your projects using MLIR and generate CUDA Tile dialect or bytecode natively!
Learn more about CUDA Tile:
• NVIDIA Developer: https://t.co/vjf6KnrMMU
• CUDA Tile Specification: https://t.co/QJiF8QVd2i
This project represents the collaborative effort of multiple teams across NVIDIA. A huge thanks to everyone who made this possible!
It has been a privilege to be involved.
I shared Elon’s 5-step video with every intern: stop wasting time on dumb things. In large companies, many software layers are built simply to expand engineering managers’ scope, adding needless complexity and protecting their jobs.
If you don't have the attention span for daily briefings, you shouldn't be president.
So many who are quick to point out Biden's mental decline are hesitant to call out Trump's mental decline.
The obvious solution is to stop electing really, really old people to serve as president.
🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and performance as CUTLASS C++-no compromises.
The learning curve for writing optimized kernels is flattened: no more wrestling with C++ templates or long compile times.
CUTLASS 4.0’s Python support delivers: 👀
🏎️ Performance on par with C++ kernels
⏱️ 100x+ faster compile times
🤔 Intuitive, Python-native syntax
⚒️ No need for NVCC installs-just pip install nvidia-cutlas-dsl and go
🤝 Seamless integration with PyTorch and the broader Python ecosystem
📚 Improved documentation and a better debugging experience: https://t.co/Ji6iVDtDOA
Key features in #CUTLASS 4.0:
✅ CuTe DSL: Python-native, low-level programming model mirroring CuTe C++ abstractions (layouts, tensors, thread/data hierarchy)
✅ Supports for NVIDIA Ampere, Ada, Hopper, and Blackwell Tensor Cores
✅ Examples and Jupyter notebooks for rapid onboarding
✅ Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell
Whether you’re a researcher, student, or ML engineer, CUTLASS 4.0 with Python lowers the barrier to high-performance GPU programming and accelerates the path from prototype to production.
📝 Examples: https://t.co/eU9rlxkcVu
📗 Jupyter notebooks: https://t.co/SdGe6xFKDV
We’re excited to see what you build-feedback and contributions welcome. 🙌
(Note: CuTe DSL is currently in public beta and will evolve with community feedback. C++ APIs remain fully supported for existing workflows).
🚨🔥 CUTLASS 4.0 is released 🔥🚨
pip install nvidia-cutlass-dsl
4.0 marks a major shift for CUTLASS: towards native GPU programming in Python
slidehelloworld.png
https://t.co/pBLMpQAXHW
@hyhieu226@nvidia These are compiled languages in the CUDA ecosystem, not wrappers.
CUTLASS 4 and cuTile are implemented with a robust and domain-aware compiler with clear benefits for programmers (e.g. fast compile times, clear and succinct error messages, Python syntax).
Just received a certificate from David. Unbelievable—I’ve been babysitting at home for 5 years. I hope to meet more friends in person once they’ve grown up.
@jessesingal I believe the movie WarGames released in 1983 demonstrates that hacking is cool.
https://t.co/uieqnf8wxo
This is an absurd quandary to tweet about.
The open source DeepSeek-R1 model is now available as an NVIDIA NIM microservice preview on https://t.co/bBiHtSVqqK to help developers securely experiment with its advanced AI reasoning capabilities.
CUTLASS 3.8 is out with full support of optimal Blackwell matrix computations and 5th generation Tensor Cores. Update your builds to use new numeric types, ample support for fused kernels, and CuTe enhancements for the Blackwell architecture.
https://t.co/sOw868VSGJ