Sanjoy Das @_sanjoydas - Twitter Profile

3 days ago

Fearless Concurrency on the GPU For those interested @melibol just posted a paper on building a safe Rust kernel programming abstraction on top of Tile IR. https://t.co/MMPxi4oOEg A short teaser: but the safety is effectively free. On a B200, the safe GEMM is competitive with cuBLAS: about 2 PFlop/s 92% of the GPU's dense f16 roofline. Read more in the paper or Melih's LinkedIn post (https://t.co/jyyfdC2Vc8) He will also be giving a talk at RustConf in September, hopefully he will see you there!

1

194

30

139

8K

_sanjoydas retweeted

tender

@tenderizzation

5 days ago

one can dream

7

318

18

74

28K

Sanjoy Das @_sanjoydas

6 days ago

@yminsky Turing Drawings by @Love2Code : https://t.co/kKdJnHXaOc

1

2

0

238

_sanjoydas retweeted

tender

@tenderizzation

about 1 month ago

a fascinating phenomenon is that the further away a computer is from earth, the less memory it uses this is why datacenters in space are a big deal

57

10K

189

785

938K

Who to follow

The ACM SIGPLAN Conference on Programming Language Design and Implementation. Official hashtag this year: #PLDI2026. Tweets by Jenna DiVincenzo and @konskallas.

Vinod Grover

@vinodg

Sr Distinguished Engineer @nvidia. Compilers, CUDA C++, PL, Machine Learning and Systems. tweets and opinions are personal.

Sanjoy Das @_sanjoydas

2 months ago

@MankyDankyBanky Nice! Would love to get your take on cuTile Rust: https://t.co/jh5yOYhqb9.

1

18

2

7

4K

Sanjoy Das @_sanjoydas

3 months ago

@anirudhbv_ce @GoogleResearch Really cool!

1

0

426

_sanjoydas retweeted

ani

@anirudhbv_ce

3 months ago

I implemented @GoogleResearch's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: https://t.co/m5vkJxWIY6 s/o @blelbach and the cuTile team at @nvidia for lending me Blackwell GPU access :) cc @sundeep @GavinSherry

144

3K

311

3K

806K

_sanjoydas retweeted

NVIDIA HPC Developer

@NVIDIAHPCDev

3 months ago

🌅 BASIC is BACK! In response to overwhelming demand from seasoned developers everywhere, we’re releasing cuTile BASIC for GPUs, bringing CUDA Tile programming to this long-overlooked language. 🧵 👇

15

205

32

56

13K

_sanjoydas retweeted

Bryce, the CUDA Colonel

@blelbach

3 months ago

Today, NVIDIA is launching the next paradigm shift in GPU programming: cuTile BASIC Write perf portable BASIC kernels and deploy them at any scale from edge inference devices like your calculator to entire GPU clusters We're going back to BASIC https://t.co/meF2T0jUSc

blelbach's tweet photo. Today, NVIDIA is launching the next paradigm shift in GPU programming: cuTile BASIC

Write perf portable BASIC kernels and deploy them at any scale from edge inference devices like your calculator to entire GPU clusters

We're going back to BASIC

https://t.co/meF2T0jUSc https://t.co/i9D5qeu6m7

16

349

39

137

35K

_sanjoydas retweeted

Bryce, the CUDA Colonel

@blelbach

3 months ago

The CUDA Tile roadmap: - SIMT/Tile interop. - Comms. - New frontend languages. Come to my talk at GTC in 30 minutes to learn more. https://t.co/bSqhnviRRc

blelbach's tweet photo. The CUDA Tile roadmap:

- SIMT/Tile interop.
- Comms.
- New frontend languages.

Come to my talk at GTC in 30 minutes to learn more.

https://t.co/bSqhnviRRc https://t.co/qLq7yH19co

5

222

20

123

10K

_sanjoydas retweeted

NVIDIA HPC Developer

@NVIDIAHPCDev

3 months ago

🎉 CUDA 13.2 just dropped, and GPU programming just got simpler. This release expands CUDA Tile support to Ampere and Ada GPUs while delivering a stronger CUDA Python stack for cluster-scale workloads. What's new: ✅ Install cuTile Python directly from PyPI: pip install cuda-tile ✅ Enhanced CUDA Python profiling and debugging across Numba-CUDA flows and Nsight tools ✅ Modern CUDA C++ and refreshed math libraries optimized for AI and HPC kernels Ready to accelerate your workflows? 📝 Read the technical deep dive: https://t.co/pE5UcJZqXU

15

822

85

174

54K

_sanjoydas retweeted

Victor Kumar @victorckumar

4 months ago

First we had one child and I thought I knew what children are like. Our second child was completely different; I’d overgeneralized. There are actually two types of children.

88

20K

586

461

326K

Sanjoy Das @_sanjoydas

4 months ago

@karpathy Generating code and deploying it (in the traditional sense) creates inflexible programs that cannot “learn on the job”.

0

20

Sanjoy Das @_sanjoydas

4 months ago

@karpathy Creating software might evolve into starting with a “blank slate” app, interacting with it until it’s sufficiently trained and save the “image”, which will continue to be incrementally trained by normal usage.

1

0

48

_sanjoydas retweeted

Greg Brockman

@gdb

5 months ago

gb200 has really been enabling us to do some amazing things

114

2K

83

128

215K

_sanjoydas retweeted

Tianqi Chen

@tqchenml

5 months ago

I’ll be giving a talk on TVM-FFI at @GPU_MODE this week! We will discuss how open ABI and FFI facilitate a fast, robust, and seamless framework interop experience across DSLs and kernel libraries.

1

129

15

24

18K

Sanjoy Das @_sanjoydas

6 months ago

@dccsillag > programs&compilers here I added some short explanatory points on these in the doc.

0

24

Sanjoy Das @_sanjoydas

6 months ago

I wrote a short proof showing that any self-hosting compiler cannot perform certain legal optimizations. Would love feedback from compiler folks - does the proof look correct, and is it already well known? Link: https://t.co/6rg61YhO1r

14

209

12

175

22K

Sanjoy Das @_sanjoydas

6 months ago

@oisyn @sparr0 I think the argument needs to be made more rigorous - there is no requirement that the compiler constant folds Compile(#P) to optimize the comparison. E.g. a compiler can optimize `X+1-1==X` to true without constant folding the LHS (`X+1-1`).

1

0

135

Sanjoy Das @_sanjoydas

6 months ago

@oisyn @sparr0 > As #P includes itself, it needs to evaluate itself for constant folding This could be an alternate proof to why the compiler cannot optimize the comparison `Compile(#P) == ...`. However, (contd)

2

0

137

Sanjoy Das

@_sanjoydas

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users