Training LLMs with NVFP4 is hard because FP4 has so few values that I can fit them all in this post: ±{0, 0.5, 1, 1.5, 2, 3, 4, 6}. But what if I told you that reducing this range even further could actually unlock better training + quantization performance?
Introducing Four Over Six, a new method for improving the accuracy of NVFP4 quantization with Adaptive Block Scaling. 🧵
Now seems like a good time to share that I’ve recently joined @thinkymachines to work on pretraining! Very excited to work on the future of human-AI collaboration with this amazing team.
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way.
We share our approach, early results, and a quick look at our model in action.
https://t.co/AFJZ5kH7Ku
Thanks for the question! For normally distributed inputs, NVINT4 can actually yield less error than NVFP4 (see Table 1 in our paper). This can result in INT4 being selected more often with IF4 — for a random input distributed with N(0, 1) I'm seeing that INT4 is selected 63.8% of the time. We'll try to include some more analysis on selection statistics in an update to our paper soon!
NVFP4 allows models to be quantized to 4 bits without too much performance degradation, but can we push 4-bit performance even further?
Today, we're releasing a new class of low-precision block-scaled data types that natively adapt to your input data: for 4-bit quantization, IF4 (Int/Float 4) allows each scaled group of 16 values to be saved as FP4 or INT4 depending on which option offers less error. Selections are recorded using the scale factor’s sign bit, which is unused in NVFP4, allowing IF4 to offer better performance with no memory overhead!
Our data types provide better downstream accuracy in LLMs, they can be implemented efficiently in next-generation hardware accelerators, and they reveal some interesting insights about low-bit quantization! 🧵
Thanks for the question! For normally distributed inputs, NVINT4 can actually yield less error than NVFP4 (see Table 1 in our paper). This can result in INT4 being selected more often with IF4 — for a random input distributed with N(0, 1) I'm seeing that INT4 is selected 63.8% of the time. We'll try to include some more analysis on selection statistics in an update to our paper soon!
Check out our paper for more analysis, and our GitHub repo if you want to experiment with low-precision block-scaled quantization schemes yourself! We also have more stuff coming out soon, especially related to 4/6, so stay tuned!
Code: https://t.co/OPtOdaiVsb
Paper: https://t.co/Q9ptj72zNg
You can simulate IF data types today using higher precision formats, but we also show that IF4 can also be implemented efficiently in next-generation hardware accelerators! We design and evaluate an IF4 multiply-accumulate unit (MAC) and find that latency increases by just 4.7% compared to a baseline NVFP4 MAC unit.
Happy to release Quartet II, a new method that pushes the frontier of 4-bit LLM training in NVFP4.
Fully-quantized pre-training in NVFP4 can now match FP8/FP16 quality much more closely, while maintaining full hardware acceleration!
[1/4]
There was a flippening in the last few months: you can run your own LLM inference with rates and performance that match or beat LLM inference APIs.
We wrote up the techniques to do so in a new guide, along with code samples.
https://t.co/vpNGPZ0vgM
Here's a non-obvious problem with block-scaled quantized Attention: at the edge of your causal mask, later tokens can leak information to earlier ones through the scale factor computation.
I wouldn't expect this leakage to matter very much since it affects scales, not values, but it turns out it does actually cause the loss to decrease a little too quickly! Very cool post by @tensorpro and team.
We trained models with MXFP4-quantized attention, but it turns out this can break causal modeling. Our latest post explains why this happens and how to fix it.
https://t.co/40D9VW0gBu