Andrew Kerr @arkerr - Twitter Profile

arkerr retweeted

2 months ago

Today, NVIDIA is launching the next paradigm shift in GPU programming: cuTile BASIC Write perf portable BASIC kernels and deploy them at any scale from edge inference devices like your calculator to entire GPU clusters We're going back to BASIC https://t.co/meF2T0jUSc

blelbach's tweet photo. Today, NVIDIA is launching the next paradigm shift in GPU programming: cuTile BASIC

Write perf portable BASIC kernels and deploy them at any scale from edge inference devices like your calculator to entire GPU clusters

We're going back to BASIC

https://t.co/meF2T0jUSc https://t.co/i9D5qeu6m7

16

349

39

136

34K

arkerr retweeted

Bryce, the CUDA Colonel

@blelbach

3 months ago

What three languages are joining cuTile Python? Find out Monday at 4PM at @NVIDIAGTC 2026. https://t.co/bSqhnviRRc

9

53

13

16

5K

Andrew Kerr @arkerr

5 months ago

@NRA_Rifleman @SavageArms Savage 110 action is the most ridiculous Rube Goldberg contraption of the modern era. Explain the floating rear baffle, please. I want my money back.

0

70

Andrew Kerr @arkerr

5 months ago

@MarinaMedvin The camera clearly lurches forward as the vehicle is backing up.

0

121

Who to follow

Haicheng Wu

@asdf1234_0

https://t.co/IovvdTeNzl

Aleksander Madry

@aleks_madry

OpenAI and MIT faculty (on leave)

Vinod Grover

@vinodg

Sr Distinguished Engineer @nvidia. Compilers, CUDA C++, PL, Machine Learning and Systems. tweets and opinions are personal.

arkerr retweeted

Jared Roesch

@roeschinc

6 months ago

Thrilled to announce we're open-sourcing the CUDA Tile dialect and bytecode! https://t.co/wpy2BoybAk What's included: • CUDA Tile MLIR dialect • Bytecode serialization/deserialization support • MLIR Python bindings for programmatic IR construction • Conformance test suite For developers: You can now integrate CUDA Tile directly into your projects using MLIR and generate CUDA Tile dialect or bytecode natively! Learn more about CUDA Tile: • NVIDIA Developer: https://t.co/vjf6KnrMMU • CUDA Tile Specification: https://t.co/QJiF8QVd2i This project represents the collaborative effort of multiple teams across NVIDIA. A huge thanks to everyone who made this possible! It has been a privilege to be involved.

11

742

119

347

57K

Andrew Kerr @arkerr

9 months ago

@adamscochran Posting any of these without citations is preposterous.

0

33

arkerr retweeted

Alex Zhurkevich @cudagdb

10 months ago

Like SGLang? Want speed of light decode perf? Checkout: https://t.co/pPxVpcjhLE

1

46

7

8

5K

arkerr retweeted

Vijay @__tensorcore__

11 months ago

CUTLASS 4.1 is now available, which adds support for ARM systems (GB200) and block scaled MMAs

4

119

11

19

8K

arkerr retweeted

Bing Xu

@bingxu_

11 months ago

I shared Elon’s 5-step video with every intern: stop wasting time on dumb things. In large companies, many software layers are built simply to expand engineering managers’ scope, adding needless complexity and protecting their jobs.

0

18

1

4

2K

arkerr retweeted

Chase Oliver

@ChaseForLiberty

about 1 year ago · Tucker

If you don't have the attention span for daily briefings, you shouldn't be president. So many who are quick to point out Biden's mental decline are hesitant to call out Trump's mental decline. The obvious solution is to stop electing really, really old people to serve as president.

ChaseForLiberty's tweet photo. If you don't have the attention span for daily briefings, you shouldn't be president.

So many who are quick to point out Biden's mental decline are hesitant to call out Trump's mental decline.

The obvious solution is to stop electing really, really old people to serve as president.

25

298

42

13

10K

arkerr retweeted

NVIDIA HPC Developer

@NVIDIAHPCDev

about 1 year ago

🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and performance as CUTLASS C++-no compromises. The learning curve for writing optimized kernels is flattened: no more wrestling with C++ templates or long compile times. CUTLASS 4.0’s Python support delivers: 👀 🏎️ Performance on par with C++ kernels ⏱️ 100x+ faster compile times 🤔 Intuitive, Python-native syntax ⚒️ No need for NVCC installs-just pip install nvidia-cutlas-dsl and go 🤝 Seamless integration with PyTorch and the broader Python ecosystem 📚 Improved documentation and a better debugging experience: https://t.co/Ji6iVDtDOA Key features in #CUTLASS 4.0: ✅ CuTe DSL: Python-native, low-level programming model mirroring CuTe C++ abstractions (layouts, tensors, thread/data hierarchy) ✅ Supports for NVIDIA Ampere, Ada, Hopper, and Blackwell Tensor Cores ✅ Examples and Jupyter notebooks for rapid onboarding ✅ Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell Whether you’re a researcher, student, or ML engineer, CUTLASS 4.0 with Python lowers the barrier to high-performance GPU programming and accelerates the path from prototype to production. 📝 Examples: https://t.co/eU9rlxkcVu 📗 Jupyter notebooks: https://t.co/SdGe6xFKDV We’re excited to see what you build-feedback and contributions welcome. 🙌 (Note: CuTe DSL is currently in public beta and will evolve with community feedback. C++ APIs remain fully supported for existing workflows).

NVIDIAHPCDev's tweet photo. 🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and performance as CUTLASS C++-no compromises.

The learning curve for writing optimized kernels is flattened: no more wrestling with C++ templates or long compile times.

CUTLASS 4.0’s Python support delivers: 👀

🏎️ Performance on par with C++ kernels
⏱️ 100x+ faster compile times
🤔 Intuitive, Python-native syntax
⚒️ No need for NVCC installs-just pip install nvidia-cutlas-dsl and go
🤝 Seamless integration with PyTorch and the broader Python ecosystem
📚 Improved documentation and a better debugging experience: https://t.co/Ji6iVDtDOA

Key features in #CUTLASS 4.0:

✅ CuTe DSL: Python-native, low-level programming model mirroring CuTe C++ abstractions (layouts, tensors, thread/data hierarchy)
✅ Supports for NVIDIA Ampere, Ada, Hopper, and Blackwell Tensor Cores
✅ Examples and Jupyter notebooks for rapid onboarding
✅ Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell

Whether you’re a researcher, student, or ML engineer, CUTLASS 4.0 with Python lowers the barrier to high-performance GPU programming and accelerates the path from prototype to production.

📝 Examples: https://t.co/eU9rlxkcVu
📗 Jupyter notebooks: https://t.co/SdGe6xFKDV

We’re excited to see what you build-feedback and contributions welcome. 🙌

(Note: CuTe DSL is currently in public beta and will evolve with community feedback. C++ APIs remain fully supported for existing workflows).

1

119

31

41

8K

arkerr retweeted

Vijay @__tensorcore__

about 1 year ago

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png https://t.co/pBLMpQAXHW

__tensorcore__'s tweet photo. 🚨🔥 CUTLASS 4.0 is released 🔥🚨

pip install nvidia-cutlass-dsl

4.0 marks a major shift for CUTLASS: towards native GPU programming in Python

slidehelloworld.png

https://t.co/pBLMpQAXHW https://t.co/ytlHS0sftS

16

424

85

156

79K

arkerr retweeted

Bryce, the CUDA Colonel

@blelbach

about 1 year ago

2

202

12

60

15K

Andrew Kerr @arkerr

about 1 year ago

Blackwell Tensor Core 2CTA GEMM example in one slide courtesy of CuTE. https://t.co/he145kOwfF

0

19

3

1K

Andrew Kerr @arkerr

about 1 year ago

@hyhieu226 @nvidia These are compiled languages in the CUDA ecosystem, not wrappers. CUTLASS 4 and cuTile are implemented with a robust and domain-aware compiler with clear benefits for programmers (e.g. fast compile times, clear and succinct error messages, Python syntax).

1

49

2

5

4K

Andrew Kerr @arkerr

about 1 year ago

@nopainkiller @msharmavikram @__tensorcore__ @NVIDIAGTC CUTLASS 4 will be a substantial departure from the existing CUTLASS Python interface which will be deprecated upon its release.

0

22

3

1

2K

Andrew Kerr @arkerr

over 1 year ago

Congratulations, @bingxu_!

Bing Xu

@bingxu_

over 1 year ago

Just received a certificate from David. Unbelievable—I’ve been babysitting at home for 5 years. I hope to meet more friends in person once they’ve grown up.

bingxu_'s tweet photo. Just received a certificate from David. Unbelievable—I’ve been babysitting at home for 5 years. I hope to meet more friends in person once they’ve grown up. https://t.co/GQjybfoc5R

13

605

8

56

57K

0

5

0

472

Andrew Kerr @arkerr

over 1 year ago

@jessesingal I believe the movie WarGames released in 1983 demonstrates that hacking is cool. https://t.co/uieqnf8wxo This is an absurd quandary to tweet about.

0

40

arkerr retweeted

NVIDIA

@nvidia

over 1 year ago

The open source DeepSeek-R1 model is now available as an NVIDIA NIM microservice preview on https://t.co/bBiHtSVqqK to help developers securely experiment with its advanced AI reasoning capabilities.

0

2K

327

253

481K

Andrew Kerr @arkerr

over 1 year ago

CUTLASS 3.8 is out with full support of optimal Blackwell matrix computations and 5th generation Tensor Cores. Update your builds to use new numeric types, ample support for fused kernels, and CuTe enhancements for the Blackwell architecture. https://t.co/sOw868VSGJ

1

95

14

17

6K

Andrew Kerr

@arkerr

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users