Ever wondered how NVIDIA's Tensor Cores lay out matrices in shared memory?
New write-up breaking down Hopper/Blackwell MMA layouts — the building blocks behind tcgen05 MMA instructions.
Thread on the key ideas 🧵👇
https://t.co/MJTFbxwVPv
Also don’t remember. But it’s been living rent free in my head ever since. Puts a very important truth that you intuit when building companies in a very accessible form.
To recap:
SMEM Tile
└─ MMA Atom Tiles
└─ Swizzle Atoms
└─ Core Matrices (8×16B)
LBO/SBO in the SMEM descriptor tell the MMA instruction how to stride between swizzle atoms. Swizzle ensures bank-conflict-free access.
Full post + code: https://t.co/LGf5rWgxbt…
Ever wondered how NVIDIA's Tensor Cores lay out matrices in shared memory?
New write-up breaking down Hopper/Blackwell MMA layouts — the building blocks behind tcgen05 MMA instructions.
Thread on the key ideas 🧵👇
https://t.co/MJTFbxwVPv
K-major with 32B swizzle. Each swizzle atom is (16,8) — 2 core matrices tall.
Key insight: LBO is unused because the MMA atom's K extent (32B) fits within one swizzle atom. Same applies for 64B/128B modes.
Only SBO is needed to stride along K.
I implemented all of this from scratch including cta group 2 with tma multicast and multiple smem stages.
https://t.co/J2ICEYpNoS
While TMA multicast itself definitely helps speed up the loads. Unfortunately, since you can’t use all the SMs in a GPC when using Tb clusters, overall utilization suffers for standalone kernels.