bikram_ @nearlypi - Twitter Profile

Pinned Tweet

4 months ago

This repository is a public log of my learning, experiments, and projects as I dive deep into: - GPU architecture - CUDA programming - Memory hierarchies - Parallelism - Acceleration for deep learning and scientific computing https://t.co/MzcmIPOBwe

1

2

0

2

366

nearlypi retweeted

SzymonOzog

@SzymonOzog_

about 14 hours ago

@IlysMoutawwakil You can do W^T @ X^T instead of W@X and then use smaller N. This is a bit annoying if you want to chain matmuls. You can also just write 2 kernels, small M one wouldn't benefit that much from tcgen05/wgmma https://t.co/DPdCDRGB8v

1

11

1

5

515

nearlypi retweeted

Omer Shlomovits

@OmerShlomovits

1 day ago

This article is full of practical, tested techniques for high-performance kernel engineering on AMD. Here's one of my favorites. None of this would have been possible without @HotAisle 🫶 cc: @simran_s_arora @_williamhu @realDanFu @Neoblizzz @Drewwad - HK is an inspiration

OmerShlomovits's tweet photo. This article is full of practical, tested techniques for high-performance kernel engineering on AMD.

Here's one of my favorites.

None of this would have been possible without @HotAisle 🫶
cc: @simran_s_arora @_williamhu @realDanFu @Neoblizzz @Drewwad - HK is an inspiration https://t.co/wLr3tYvOTL

1

48

6

33

4K

nearlypi retweeted

0xSero

@0xSero

3 days ago

Rejoice fellow 6000 enjoyers. We have GLM at home https://t.co/pAUQf4WPpe

54

847

49

404

121K

Who to follow

Mikhail Kennerley

@mkhlknnrly

PhD Student at @NUSingapore

Physicist. PhD. Environmental Science. CG/MS. Wildlife amateur photographer. LOTR and RPGs fan. Support de corazón. Aprendiendo a ser Dungeon Mistress.

nearlypi retweeted

Tanmay Patil

@TanmayPatil79

2 days ago

Finally wrote a blog on the Blackwell 2CTA GEMM kernel I’ve been working on. It covers the kernel pipeline, buffering, TMA loads, and TMEM-based accumulation. Feedback is welcome :) https://t.co/tRcv0X7kxI

0

34

10

21

2K

nearlypi retweeted

batuhan the fal guy

@isidentical

3 days ago

Personally hiring Distributed Systems Engineers in San Francisco. Rust and Python. One of the fastest growing companies, to date. With the leader in the vertical.

11

227

15

67

20K

nearlypi retweeted

Jared Roesch

@roeschinc

3 days ago

Fearless Concurrency on the GPU For those interested @melibol just posted a paper on building a safe Rust kernel programming abstraction on top of Tile IR. https://t.co/MMPxi4oOEg A short teaser: but the safety is effectively free. On a B200, the safe GEMM is competitive with cuBLAS: about 2 PFlop/s 92% of the GPU's dense f16 roofline. Read more in the paper or Melih's LinkedIn post (https://t.co/jyyfdC2Vc8) He will also be giving a talk at RustConf in September, hopefully he will see you there!

1

194

30

139

8K

nearlypi retweeted

rohan anil

@_arohan_

3 days ago

Here is a tutorial for uninitiated for QR for those wanting to enter the competition. All you need to know is a little bit of algebra. Here @ means matrix multiplication. Let A be a 3 x 3 matrix and task is given A can you find Q and R. A special property of Q is that Q @ Q.T = I and R is a upper triangular matrix: A find Q and R. Q @ R = A [ [0, 1, 0], [ [2, 3, 4], [ [0, 5, 6], [1, 0, 0], @ [0, 5, 6], = [2, 3, 4], [0, 0, 1] ] [0, 0, 7] ] [0, 0, 7] ] The R matrix has a special property: [ [2, 3, 4], [0, 5, 6], [0, 0, 7] ] We are generally fine with any R matrix with this structure, see the zeros on the lower triangle. [ [0, 5, 6], A = [2, 3, 4], [0, 0, 7] ] First column of A: [ [0], x = [2], [0] ] Householder matrix: [ [0, 1, 0], H = [1, 0, 0], [0, 0, 1] ] Now compute Hx: [ [0, 1, 0], [ [0], H x = [1, 0, 0], @ [2], [0, 0, 1] ] [0] ] [ [2], = [0], [0] ] Now how do we get H? Well it turns out these are called reflectors. The householder reflectors. Think of putting a mirror on your vector.

7

160

7

184

36K

nearlypi retweeted

Charles 🎉 Frye

@charles_irl

3 days ago

CuTile-rs paper! https://t.co/5zOOYfS8Pz

3

427

51

266

17K

nearlypi retweeted

tender

@tenderizzation

5 days ago

fun fact cuBLAS and cuDNN specialize on alignment too and for user-managed heuristics caches like for cuDNN it’s a cache miss if you have bad alignment the only thing preventing this is default allocator alignment and the implicit agreement that you will never do tensor[…, 1:]

4

56

2

25

8K

nearlypi retweeted

Zhijian Liu

@zhijianliu_

6 days ago

Full write-up: https://t.co/ML3pL4Rf99 DFlash generalizes to most target LLMs — want it on yours? Let us know!

1

11

2

7

722

nearlypi retweeted

Modal @modal

6 days ago

We worked with @lmsysorg and https://t.co/Cg0JsVomui to - integrate DFlash spec into @sgl_project - make it faster with overlap - train a DFlash drafter for @Alibaba_Qwen 397B-A17B The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.

modal's tweet photo. We worked with @lmsysorg and https://t.co/Cg0JsVomui to
- integrate DFlash spec into @sgl_project
- make it faster with overlap
- train a DFlash drafter for @Alibaba_Qwen 397B-A17B

The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP. https://t.co/tIypW8vc7E

8

232

29

73

40K

nearlypi retweeted

Mark Saroufim

@marksaroufim

6 days ago

We've released the QR problem, a more robust qr_v2 with a fresh leaderboard so please resubmit! Thank you to @blelbach, @myainotez and @nikhilbarhate99 for sharing feedback. Sorry if I missed anyone! I considered automatically backfilling all submissions but the rankings do change quite a bit so I figured a refresh would be better. Changelog * Fail submissions if they fail when we change random seeds * Add nasty correctness cases with more degenerate inputs in mixed batches * Recheck correctness when doing perf testing to avoid Volkswagen cheat * Reject Nan/Inf residuals * Validate each matrix factorization residual, since averaging was hiding bad matrices * Old QR is still open so folks can't see submissions but you can't submit anything to it Wontfix * Stream hacking is still banned via very blunt ban of the word "stream" we don't have a good solution for this * CUDA graphs are allowed but not particularly interesting to us Best submissions so far if I resubmit their solutions are

marksaroufim's tweet photo. We've released the QR problem, a more robust qr_v2 with a fresh leaderboard so please resubmit!

Thank you to @blelbach, @myainotez and @nikhilbarhate99 for sharing feedback. Sorry if I missed anyone!

I considered automatically backfilling all submissions but the rankings do change quite a bit so I figured a refresh would be better.

Changelog
* Fail submissions if they fail when we change random seeds
* Add nasty correctness cases with more degenerate inputs in mixed batches
* Recheck correctness when doing perf testing to avoid Volkswagen cheat
* Reject Nan/Inf residuals
* Validate each matrix factorization residual, since averaging was hiding bad matrices
* Old QR is still open so folks can't see submissions but you can't submit anything to it

Wontfix
* Stream hacking is still banned via very blunt ban of the word "stream" we don't have a good solution for this
* CUDA graphs are allowed but not particularly interesting to us

Best submissions so far if I resubmit their solutions are

2

75

10

32

36K

nearlypi retweeted

Thien Tran @gaunernst

7 days ago

Another interesting find. CopyBulkG2SOp is lowered to cp.async.bulk.shared::cluster instead of .shared::cta, and CuteDSL automatically inserts mapa, even when threadblock cluster is not used. Had to go back to my ole reliable PTX 🔫

gaunernst's tweet photo. Another interesting find. CopyBulkG2SOp is lowered to cp.async.bulk.shared::cluster instead of .shared::cta, and CuteDSL automatically inserts mapa, even when threadblock cluster is not used.
Had to go back to my ole reliable PTX 🔫 https://t.co/zZmchuk9Op

1

68

4

27

5K

nearlypi retweeted

Mark Saroufim

@marksaroufim

8 days ago

Launching a new kernel competition: Linear Algebra Kernels For The Age Of Research. First problem: batched QR decomposition on B200. Old math, modern hardware. Prize: Rare swag and hangout in SF

12

399

30

304

143K

nearlypi retweeted

Doğaç

@dogacel0

11 days ago

Testing Mythos for GPU kernel generation. I will test it under 3 kernels: DSA, GDN and MoE routing, let's see how it performs over Opus 4.7 that previously won the contest against humans for DSA track.

dogacel0's tweet photo. Testing Mythos for GPU kernel generation. I will test it under 3 kernels: DSA, GDN and MoE routing, let's see how it performs over Opus 4.7 that previously won the contest against humans for DSA track. https://t.co/c6Gx6aA4sa

10

200

6

86

57K

nearlypi retweeted

subho ghosh

@SubhoGhosh02

12 days ago

Wherever I go I see hilbert curve improving L2 cache efficiency ft. mpp programming guide - https://t.co/iEIGZ20s53

4

114

12

87

5K

bikram_ @nearlypi

15 days ago

@gaunernst It looks so beautiful I must say, maybe one day I write code like this.

0

1

0

173

nearlypi retweeted

Thien Tran @gaunernst

15 days ago

Finally got some time to port my tcgen05 kernels to CuteDSL. For PTX enjoyers, this should feel natural (except TMA 🤣). A BF16 MMA mainloop is shown below. I also worked up an example for MXFP8 and NVFP4.

gaunernst's tweet photo. Finally got some time to port my tcgen05 kernels to CuteDSL. For PTX enjoyers, this should feel natural (except TMA 🤣).
A BF16 MMA mainloop is shown below. I also worked up an example for MXFP8 and NVFP4. https://t.co/X6caq28Fzo

3

135

10

56

11K

nearlypi retweeted

bycloud

@bycloudai

15 days ago

actually did a 45 mins of yap & actually took 1 month (and a few more days) to finish yay

6

74

7

30

17K

nearlypi retweeted

Larry Dial

@classiclarryd

15 days ago

New NanoGPT Speedrun WR at 79.7 (-1.5s) from @TrianX , with a brilliant solution to hash collisions on the bigram hash embedding. Instead of every bigram in a bucket returning the same embed, a secondary hash gives each bigram its own ±1 sign pattern (one of 8192), applied element-wise, e.g. x·[1,−1,1,1,…]. Each bigram in the bucket then reads a different partial reflection of the one stored row. https://t.co/2OLWs5osOG

classiclarryd's tweet photo. New NanoGPT Speedrun WR at 79.7 (-1.5s) from @TrianX , with a brilliant solution to hash collisions on the bigram hash embedding. Instead of every bigram in a bucket returning the same embed, a secondary hash gives each bigram its own ±1 sign pattern (one of 8192), applied element-wise, e.g. x·[1,−1,1,1,…]. Each bigram in the bucket then reads a different partial reflection of the one stored row. https://t.co/2OLWs5osOG

1

121

16

63

7K

bikram_

@nearlypi

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users