VJ @vjkrf1 - Twitter Profile

Pinned Tweet

4 months ago

Ever wondered how NVIDIA's Tensor Cores lay out matrices in shared memory? New write-up breaking down Hopper/Blackwell MMA layouts — the building blocks behind tcgen05 MMA instructions. Thread on the key ideas 🧵👇 https://t.co/MJTFbxwVPv

1

14

2

11

4K

VJ

@vjkrf1

about 1 month ago

Found one difference .. mxf4 supports k=96 with block32.

0

88

VJ

@vjkrf1

about 1 month ago

Block scaled tcgen05 mma with .kind::mxf4 is redundant given .kind::mxf4nvf4 covers the same combination of dtypes and block size? What am I missing?

1

0

161

VJ

@vjkrf1

about 2 months ago

🫡

0

94

Who to follow

Wojciech Zielonka

@w_zielonka

PhD student @MPI_IS interested in Digital Humans | Previously MSc @TU_Muenchen | https://t.co/e6gpsvKQ0c

Jianwen Xie

@jianwen_xie

AI research scientist | SAC @ NeurIPS | AE @ TPAMI/TIP/TNNLS | PhD @ UCLA | GenAI, AI4Science, Agentic AI, & Robotics

Garrett Merz

@merz_garrett

Postdoc, AI for physics @datascience_uw- prev @UMichPhysics, @OSUphysics. Empty hands & desire to unbuild walls. he/they, I guess

vjkrf1 retweeted

tobi lutke

@tobi

about 2 months ago

Also don’t remember. But it’s been living rent free in my head ever since. Puts a very important truth that you intuit when building companies in a very accessible form.

64

2K

157

1K

213K

VJ

@vjkrf1

3 months ago

@awnihannun 💯

0

1

0

609

VJ

@vjkrf1

4 months ago

To recap: SMEM Tile └─ MMA Atom Tiles └─ Swizzle Atoms └─ Core Matrices (8×16B) LBO/SBO in the SMEM descriptor tell the MMA instruction how to stride between swizzle atoms. Swizzle ensures bank-conflict-free access. Full post + code: https://t.co/LGf5rWgxbt…

0

65

VJ

@vjkrf1

4 months ago

Ever wondered how NVIDIA's Tensor Cores lay out matrices in shared memory? New write-up breaking down Hopper/Blackwell MMA layouts — the building blocks behind tcgen05 MMA instructions. Thread on the key ideas 🧵👇 https://t.co/MJTFbxwVPv

1

14

2

11

4K

VJ

@vjkrf1

4 months ago

K-major with 32B swizzle. Each swizzle atom is (16,8) — 2 core matrices tall. Key insight: LBO is unused because the MMA atom's K extent (32B) fits within one swizzle atom. Same applies for 64B/128B modes. Only SBO is needed to stride along K.

vjkrf1's tweet photo. K-major with 32B swizzle. Each swizzle atom is (16,8) — 2 core matrices tall.

Key insight: LBO is unused because the MMA atom's K extent (32B) fits within one swizzle atom. Same applies for 64B/128B modes.

Only SBO is needed to stride along K. https://t.co/PsDg8sSD2T

1

0

114

VJ

@vjkrf1

5 months ago

I implemented all of this from scratch including cta group 2 with tma multicast and multiple smem stages. https://t.co/J2ICEYpNoS While TMA multicast itself definitely helps speed up the loads. Unfortunately, since you can’t use all the SMs in a GPC when using Tb clusters, overall utilization suffers for standalone kernels.

0

1

0

78

VJ

@vjkrf1

5 months ago

@vega_myhre If it helps, I wrote a post on this earlier this week. https://t.co/GjH9ASe0Vz

1

2

1

127

VJ

@vjkrf1

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users