@maharshii Back when I was building maxas, that yield bit was the hardest of the control codes to figure out. Still wasn't sure about it till I talked to folks at nvidia. I'm not surprised ptxas is still bad at finding optimal placement for it. https://t.co/TvOCqfV17r
@cis_female Why not just use last block to zero the scratchpad: if (res+1 == gridDim.x)? But if you do need to poll you'll need something like ld.volatile or ld.relaxed (not just https://t.co/iKxtKdriL5).
@cHHillee@karpathy You'd also want a custom cg::reduce that can operate on float2. Anyway this math has been used to successfully train and inference models at scale. Though now-a-days I think rmsnorm is preferred and all this is a moot point.
@cHHillee@karpathy I was suggesting doing both the numerical shortcut and loading acts into registers for reuse. The shortcut is not unstable given you can do accumulations in close to log(n) serial steps, cancelation is not an issue and you really only need ~3 bits of accurate mantissa at output.
@cHHillee@karpathy This is what I was getting at. Perhaps I wasn’t explicit enough. But yah, with the input packed in fp16x2 registers you can go as high as 32k on the channel dim. Though the backward pass loads two tenors and can only fit 16k.
@rzidane360@karpathy It just isn't empirically observed. To see you'd have to have a variance collapse and with the way these networks are initialized and how distributions don't stray too far in training I don't see that as likely. And even if were happening these networks are robust to prec loss
@karpathy Don't forget registers.. you have 64k*4*108 = 28M which is more than shared. And it's the fastest local state to leverage (followed by shared then L2)
@rzidane360@karpathy In practice this is never observed with training distributions.. and if you're paranoid you can convert to double precision and do subtraction there (this has zero overhead in this bandwidth bound op)
@karpathy Speedup will depend on dims and if the original multiple passes were able to be served out of L2. For things like ln, ln_grad, softmax, softmax_grad where you have multiple passes over inputs with reductions in between always do the math to see what you can fit in local state.
@unixpickle For TitanX (or pre-tensorcore gpus) you can check out my old blog about the assembler and matmul I wrote. Some of it is no longer relevant but there's still a fair amount that still is. https://t.co/XK1BCh9Iuu
Great, I'll let you work on it a bit :) Without the normal remapping 4b=>fp16 is pretty trivial. Just mask and shift the bits to the fp16 denorm position and apply scale/bias (works for sym and asym scemes). 1.5 instructions per element.
and.b32 a0, b4x8, 0xf000f000;
and.b32 a1, b4x8, 0x0f000f00;
and.b32 a3, b4x8, 0x000f000f;
and.b32 a3, fp4x8, 0x000f000f;
shr.b32 a0, a0, 6;
shr.b32 a1, a1, 2;
shl.b32 a2, a2, 2;
shl.b32 a3, a3, 6;
fma.rn.f16x2 a0, a0, scale, bias;
fma.rn.f16x2 a1, a1, scale, bias;
fma.rn.f16x2 a2, a2, scale, bias;
fma.rn.f16x2 a3, a3, scale, bias;
@Tim_Dettmers That's pretty expensive.. keep in mind SFU ops are much lower throughput (maybe 4x?). I've been pondering an in register lookup table using prmt and/or lop3. Ideally you generate 2 fp16 outputs in pairs. An int8 mapping might be useful as well.
@Tim_Dettmers How many instructions per element do you think you can get it down to? That is can you keep it fast for large batches, and not just small batch / bandwidth bound?
@Tim_Dettmers Mostly yah. though attention op cannot be statically transposed. QK is fine I think.. WV less so. Though I guess you can transpose the V output from the previous projection. Of course by now we'd rather have some fp4 support. Inline conversion for that will likely be possible
@Tim_Dettmers From the ptx docs: "The transpose operation is only supported for the https://t.co/1UzekOIecX_async variants with .f16/ .bf16 types on matrices accessed from shared memory using matrix descriptors." So getting fp8 transposed is likely going to be tricky and inefficient.