Scott Gray @scottgray76 - Twitter Profile

over 1 year ago

@maharshii Back when I was building maxas, that yield bit was the hardest of the control codes to figure out. Still wasn't sure about it till I talked to folks at nvidia. I'm not surprised ptxas is still bad at finding optimal placement for it. https://t.co/TvOCqfV17r

2

76

3

30

10K

Scott Gray @scottgray76

about 2 years ago

@cis_female Why not just use last block to zero the scratchpad: if (res+1 == gridDim.x)? But if you do need to poll you'll need something like ld.volatile or ld.relaxed (not just https://t.co/iKxtKdriL5).

1

5

0

3

1K

Scott Gray @scottgray76

about 2 years ago

@cHHillee @karpathy You'd also want a custom cg::reduce that can operate on float2. Anyway this math has been used to successfully train and inference models at scale. Though now-a-days I think rmsnorm is preferred and all this is a moot point.

0

7

0

1

1K

Scott Gray @scottgray76

about 2 years ago

@cHHillee @karpathy I was suggesting doing both the numerical shortcut and loading acts into registers for reuse. The shortcut is not unstable given you can do accumulations in close to log(n) serial steps, cancelation is not an issue and you really only need ~3 bits of accurate mantissa at output.

2

6

0

2

1K

Who to follow

Corey Lynch

@coreylynch

Director of AI at @figure_robot, building Helix 🧬

Brandon Amos

@brandondamos

🧙 RL @Reflection_AI past: @MetaAi @GoogleDeepmind @SCSatCMU @Cornell_Tech

Ryan Adams

@ryan_p_adams

Machine Learning Researcher, CS Professor (@PrincetonCS), Dad, Woodworker

Scott Gray @scottgray76

about 2 years ago

@cHHillee @karpathy This is what I was getting at. Perhaps I wasn’t explicit enough. But yah, with the input packed in fp16x2 registers you can go as high as 32k on the channel dim. Though the backward pass loads two tenors and can only fit 16k.

1

10

0

3

1K

Scott Gray @scottgray76

about 2 years ago

@rzidane360 @karpathy It just isn't empirically observed. To see you'd have to have a variance collapse and with the way these networks are initialized and how distributions don't stray too far in training I don't see that as likely. And even if were happening these networks are robust to prec loss

0

1

0

555

Scott Gray @scottgray76

about 2 years ago

@karpathy Don't forget registers.. you have 64k*4*108 = 28M which is more than shared. And it's the fastest local state to leverage (followed by shared then L2)

1

28

1

3

3K

Scott Gray @scottgray76

about 2 years ago

@rzidane360 @karpathy In practice this is never observed with training distributions.. and if you're paranoid you can convert to double precision and do subtraction there (this has zero overhead in this bandwidth bound op)

1

3

0

1

621

Scott Gray @scottgray76

about 2 years ago

@karpathy Speedup will depend on dims and if the original multiple passes were able to be served out of L2. For things like ln, ln_grad, softmax, softmax_grad where you have multiple passes over inputs with reductions in between always do the math to see what you can fit in local state.

1

21

1

4

4K

Scott Gray @scottgray76

over 2 years ago

OpenAI is nothing without its people

19

718

36

8

55K

Scott Gray @scottgray76

over 2 years ago

❤️

Sam Altman

@sama

over 2 years ago

i love the openai team so much

4K

68K

4K

1K

32M

9

547

15

22

139K

Scott Gray @scottgray76

over 2 years ago

@unixpickle For TitanX (or pre-tensorcore gpus) you can check out my old blog about the assembler and matmul I wrote. Some of it is no longer relevant but there's still a fair amount that still is. https://t.co/XK1BCh9Iuu

1

3

0

1

639

Scott Gray @scottgray76

about 3 years ago

Great, I'll let you work on it a bit :) Without the normal remapping 4b=>fp16 is pretty trivial. Just mask and shift the bits to the fp16 denorm position and apply scale/bias (works for sym and asym scemes). 1.5 instructions per element. and.b32 a0, b4x8, 0xf000f000; and.b32 a1, b4x8, 0x0f000f00; and.b32 a3, b4x8, 0x000f000f; and.b32 a3, fp4x8, 0x000f000f; shr.b32 a0, a0, 6; shr.b32 a1, a1, 2; shl.b32 a2, a2, 2; shl.b32 a3, a3, 6; fma.rn.f16x2 a0, a0, scale, bias; fma.rn.f16x2 a1, a1, scale, bias; fma.rn.f16x2 a2, a2, scale, bias; fma.rn.f16x2 a3, a3, scale, bias;

0

4

0

3

488

Scott Gray @scottgray76

about 3 years ago

@Tim_Dettmers That's pretty expensive.. keep in mind SFU ops are much lower throughput (maybe 4x?). I've been pondering an in register lookup table using prmt and/or lop3. Ideally you generate 2 fp16 outputs in pairs. An int8 mapping might be useful as well.

1

6

0

1

451

Scott Gray @scottgray76

about 3 years ago

@Tim_Dettmers How many instructions per element do you think you can get it down to? That is can you keep it fast for large batches, and not just small batch / bandwidth bound?

1

5

0

1

439

Scott Gray @scottgray76

about 3 years ago

@soumithchintala @NumFOCUS I'll preemptively match Soumith's match. I've gotten way more value than this out of those tools over the years.

scottgray76's tweet photo. @soumithchintala @NumFOCUS I'll preemptively match Soumith's match. I've gotten way more value than this out of those tools over the years. https://t.co/6BoPJKSi4e

4

47

3

0

6K

Scott Gray @scottgray76

over 3 years ago

@Tim_Dettmers Mostly yah. though attention op cannot be statically transposed. QK is fine I think.. WV less so. Though I guess you can transpose the V output from the previous projection. Of course by now we'd rather have some fp4 support. Inline conversion for that will likely be possible

1

5

0

511

Scott Gray @scottgray76

over 3 years ago

@Tim_Dettmers From the ptx docs: "The transpose operation is only supported for the https://t.co/1UzekOIecX_async variants with .f16/ .bf16 types on matrices accessed from shared memory using matrix descriptors." So getting fp8 transposed is likely going to be tricky and inefficient.

1

3

0

1

260

scottgray76 retweeted