Jack Cook @jackcookjack - Twitter Profile

Pinned Tweet

6 months ago

Training LLMs with NVFP4 is hard because FP4 has so few values that I can fit them all in this post: ±{0, 0.5, 1, 1.5, 2, 3, 4, 6}. But what if I told you that reducing this range even further could actually unlock better training + quantization performance? Introducing Four Over Six, a new method for improving the accuracy of NVFP4 quantization with Adaptive Block Scaling. 🧵

jackcookjack's tweet photo. Training LLMs with NVFP4 is hard because FP4 has so few values that I can fit them all in this post: ±{0, 0.5, 1, 1.5, 2, 3, 4, 6}. But what if I told you that reducing this range even further could actually unlock better training + quantization performance?

Introducing Four Over Six, a new method for improving the accuracy of NVFP4 quantization with Adaptive Block Scaling. 🧵

6

253

40

170

70K

Jack Cook @jackcookjack

23 days ago

@HanGuo97 @thinkymachines Anytime!

0

1

0

209

Jack Cook @jackcookjack

23 days ago

Now seems like a good time to share that I’ve recently joined @thinkymachines to work on pretraining! Very excited to work on the future of human-AI collaboration with this amazing team.

Thinking Machines

@thinkymachines

23 days ago

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. https://t.co/AFJZ5kH7Ku

461

16K

2K

12K

8M

12

138

6

7

11K

jackcookjack retweeted

alex zhang

@a1zhang

2 months ago

4/6 --> 6/7... jokes aside jack put this out crazy fast, and it's a very clever idea that I hope it gets standardized in new hardware :)

1

75

7

35

11K

Who to follow

Vivian Shen

@vhshen

Asst. Prof @ETC @CMU PhD from @CMU_Robotics, FIGLab @cmuHCII.

tomadto

@hint_of_tomadto

jun tekken kazama enjoyer trains and rhythm games fernando alonso mit she/her 🏳️‍⚧️ https://t.co/xsd5V2pZqO

jessding 🧶🪡

@_jessding

jackcookjack retweeted

Andrei Panferov @ICLR @black_samorez

2 months ago

Very cool paper! Glad to see our rotation-based unbiased gradient estimation scheme from Quartet II improve quality as well.

0

4

1

3

956

Jack Cook @jackcookjack

2 months ago

@CliffLattner https://t.co/3r3tJQJl50

Jack Cook @jackcookjack

2 months ago

Thanks for the question! For normally distributed inputs, NVINT4 can actually yield less error than NVFP4 (see Table 1 in our paper). This can result in INT4 being selected more often with IF4 — for a random input distributed with N(0, 1) I'm seeing that INT4 is selected 63.8% of the time. We'll try to include some more analysis on selection statistics in an update to our paper soon!

1

5

1

0

762

0

219

Jack Cook @jackcookjack

2 months ago

NVFP4 allows models to be quantized to 4 bits without too much performance degradation, but can we push 4-bit performance even further? Today, we're releasing a new class of low-precision block-scaled data types that natively adapt to your input data: for 4-bit quantization, IF4 (Int/Float 4) allows each scaled group of 16 values to be saved as FP4 or INT4 depending on which option offers less error. Selections are recorded using the scale factor’s sign bit, which is unused in NVFP4, allowing IF4 to offer better performance with no memory overhead! Our data types provide better downstream accuracy in LLMs, they can be implemented efficiently in next-generation hardware accelerators, and they reveal some interesting insights about low-bit quantization! 🧵

jackcookjack's tweet photo. NVFP4 allows models to be quantized to 4 bits without too much performance degradation, but can we push 4-bit performance even further?

Today, we're releasing a new class of low-precision block-scaled data types that natively adapt to your input data: for 4-bit quantization, IF4 (Int/Float 4) allows each scaled group of 16 values to be saved as FP4 or INT4 depending on which option offers less error. Selections are recorded using the scale factor’s sign bit, which is unused in NVFP4, allowing IF4 to offer better performance with no memory overhead!

Our data types provide better downstream accuracy in LLMs, they can be implemented efficiently in next-generation hardware accelerators, and they reveal some interesting insights about low-bit quantization! 🧵

14

438

83

296

52K

Jack Cook @jackcookjack

2 months ago

Thanks for the question! For normally distributed inputs, NVINT4 can actually yield less error than NVFP4 (see Table 1 in our paper). This can result in INT4 being selected more often with IF4 — for a random input distributed with N(0, 1) I'm seeing that INT4 is selected 63.8% of the time. We'll try to include some more analysis on selection statistics in an update to our paper soon!

1

5

1

0

762

jackcookjack retweeted

Charles 🎉 Frye

@charles_irl

2 months ago

doot doot

0

37

3

12

4K

jackcookjack retweeted

Peyton Walters

@peywalt

2 months ago

floats aren't cool. you know what's cool? integers.

0

11

1

1K

Jack Cook @jackcookjack

2 months ago

Check out our paper for more analysis, and our GitHub repo if you want to experiment with low-precision block-scaled quantization schemes yourself! We also have more stuff coming out soon, especially related to 4/6, so stay tuned! Code: https://t.co/OPtOdaiVsb Paper: https://t.co/Q9ptj72zNg

0

23

4

7

1K

Jack Cook @jackcookjack

2 months ago

You can simulate IF data types today using higher precision formats, but we also show that IF4 can also be implemented efficiently in next-generation hardware accelerators! We design and evaluate an IF4 multiply-accumulate unit (MAC) and find that latency increases by just 4.7% compared to a baseline NVFP4 MAC unit.

jackcookjack's tweet photo. You can simulate IF data types today using higher precision formats, but we also show that IF4 can also be implemented efficiently in next-generation hardware accelerators! We design and evaluate an IF4 multiply-accumulate unit (MAC) and find that latency increases by just 4.7% compared to a baseline NVFP4 MAC unit.

1

17

1

3

2K

jackcookjack retweeted

Dan Alistarh @DAlistarh

4 months ago

Happy to release Quartet II, a new method that pushes the frontier of 4-bit LLM training in NVFP4. Fully-quantized pre-training in NVFP4 can now match FP8/FP16 quality much more closely, while maintaining full hardware acceleration! [1/4]

5

170

25

77

20K

Jack Cook @jackcookjack

5 months ago

oh, you want a kernel that'll be right about 93% of the time and have tons of really weird and unpredictable edge cases? yeah I'd recommend Triton

0

8

0

400

jackcookjack retweeted

Charles 🎉 Frye

@charles_irl

5 months ago

There was a flippening in the last few months: you can run your own LLM inference with rates and performance that match or beat LLM inference APIs. We wrote up the techniques to do so in a new guide, along with code samples. https://t.co/vpNGPZ0vgM

charles_irl's tweet photo. There was a flippening in the last few months: you can run your own LLM inference with rates and performance that match or beat LLM inference APIs.

We wrote up the techniques to do so in a new guide, along with code samples.

https://t.co/vpNGPZ0vgM https://t.co/wCaWIXwBf6

21

884

99

993

94K

Jack Cook @jackcookjack

5 months ago

Here's a non-obvious problem with block-scaled quantized Attention: at the edge of your causal mask, later tokens can leak information to earlier ones through the scale factor computation. I wouldn't expect this leakage to matter very much since it affects scales, not values, but it turns out it does actually cause the loss to decrease a little too quickly! Very cool post by @tensorpro and team.

jackcookjack's tweet photo. Here's a non-obvious problem with block-scaled quantized Attention: at the edge of your causal mask, later tokens can leak information to earlier ones through the scale factor computation.

I wouldn't expect this leakage to matter very much since it affects scales, not values, but it turns out it does actually cause the loss to decrease a little too quickly! Very cool post by @tensorpro and team.

tensorpro

@tensorpro

5 months ago

We trained models with MXFP4-quantized attention, but it turns out this can break causal modeling. Our latest post explains why this happens and how to fix it. https://t.co/40D9VW0gBu

1

102

17

66

32K

0

18

3

9

3K

Jack Cook

@jackcookjack

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users