kyle yu @brrrkyle - Twitter Profile

Pinned Tweet

about 2 months ago

this is how i wish i learned GPU fundamentals not a lengthy textbook. not a static image. every concept is an interactive visualization. covering the SM architecture, memory coalescing, synchronization, and more. what concepts do you want to see next? https://t.co/D7lq9xwXyk

3

250

22

448

47K

kyle yu @brrrkyle

1 day ago

@h100envy check out https://t.co/D7lq9xxvnS 🤝

0

19

kyle yu @brrrkyle

4 days ago

@gpusteve yessir

0

1

0

24

kyle yu @brrrkyle

5 days ago

I'm on vacation in Hong Kong and just shipped BrrrViz Chapter 11: Tiling from my hotel room. It's 6 interactive visuals that will help you grasp the concept. It's 1:43am. I'm tired. Hope it helps and go check it out :) https://t.co/D7lq9xxvnS

brrrkyle's tweet photo. I'm on vacation in Hong Kong and just shipped BrrrViz Chapter 11: Tiling from my hotel room. It's 6 interactive visuals that will help you grasp the concept. It's 1:43am. I'm tired. Hope it helps and go check it out :)

https://t.co/D7lq9xxvnS https://t.co/WuxWFrS5SP

1

15

0

13

792

kyle yu @brrrkyle

4 days ago

@Aru__09 @GPU_MODE nice work man!

1

0

54

kyle yu @brrrkyle

8 days ago

Four interactive slides walk through the optimizations: 1. shared memory 2. warp packing 3. minimize bank conflicts 4. thread coarsening Free at https://t.co/D7lq9xxvnS ⚡

0

367

kyle yu @brrrkyle

8 days ago

You launch a million threads and have them queue up to write to one address one at a time. That's atomicAdd. It's correct. It's also a for-loop on a parallel computer. Switch to a reduction tree and you fix the bottleneck, but introduce a new one: all but one thread is idle.

1

0

75

kyle yu @brrrkyle

9 days ago

@yash1_ Thanks for sharing brotha!

0

1

0

9

kyle yu @brrrkyle

24 days ago

@heave448 i love modal’s glossary. brrrviz helps those who learn better by seeing. also i’ll be covering ml systems in depth soon!

0

93

kyle yu @brrrkyle

about 2 months ago

this is how i wish i learned GPU fundamentals not a lengthy textbook. not a static image. every concept is an interactive visualization. covering the SM architecture, memory coalescing, synchronization, and more. what concepts do you want to see next? https://t.co/D7lq9xwXyk

3

250

22

448

47K

brrrkyle retweeted

Elliot Arledge

@elliotarledge

25 days ago

great, intuitive resource. worth a few mins playing with as a refresher even if you've been through the fundamentals

1

377

22

506

43K

kyle yu @brrrkyle

25 days ago

@sadjikun 🫡

1

0

24

kyle yu @brrrkyle

26 days ago

@XandarXam Thanks for the support man!

0

1

0

31

kyle yu @brrrkyle

26 days ago

@goyal__pramod glad you like it 🤙

0

1

0

924

brrrkyle retweeted

Pramod Goyal

@goyal__pramod

26 days ago

HOLY JESUS THIS IS AMAZING

1

491

24

473

44K

kyle yu @brrrkyle

27 days ago

@goyal__pramod check out https://t.co/D7lq9xwXyk for more gpu visuals 🤙

4

138

13

212

47K

kyle yu @brrrkyle

about 1 month ago

Chapter 9 of BrrrViz walks you through both scenarios. https://t.co/D7lq9xwXyk

0

126

kyle yu @brrrkyle

about 1 month ago

Most GPU bugs don't crash your program. They just give you the wrong answer. Silently. When thousands of threads try to update the same memory address simultaneously, each one does three things: 📖 read the current value ⚡ execute their computation ✍ write back the result

brrrkyle's tweet photo. Most GPU bugs don't crash your program. They just give you the wrong answer. Silently.

When thousands of threads try to update the same memory address simultaneously, each one does three things:
📖 read the current value
⚡ execute their computation
✍ write back the result https://t.co/VZBfeBiA1x

1

2

1

0

237

kyle yu @brrrkyle

about 1 month ago

The cost: serialization. Threads queue at the address one at a time. The more threads contend for the same location, the more your parallelism collapses into a bottleneck. This is why real GPU kernels accumulate locally in registers first, then do a single atomicAdd at the end.

1

0

138

brrrkyle retweeted

Zak 🦈 (e/acc)

@ZakShark

about 1 month ago

Formez vous à l'inference/kernel engineering. Savoir bien optimiser les GPU kernels dans les workloads d'inference vaut de l'or. Maitriser CUDA ou Triton, vLLM, SGLang, TensorRT-LLM est un vrai plus si vous voulez vous démarquer pour 2026-2027 en que AI/ML Engineer.

11

498

48

459

21K

kyle yu

@brrrkyle

Last Seen Users on Sotwe

Trends for you

Most Popular Users