Everytime I reproduce a post like this - a decent AI like ChatGPT actually answers correctly?
This problem is frequently just a bias due to the way networks tokenized, and a fast service like Google AI overview, does not do proper thinking. Humans have weird cognitive biases too if you prompt them quickly. Ask someone to quickly answer: If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?
@didier_lopes@Tim_Dettmers You’ll find that the ‘super weights’ are just very significant weights in the channels causing this large outlier behavior :)
@didier_lopes@Tim_Dettmers Basically, just clipping the large activations is very harmful. If you remove some of the larger weights in the corresponding channels, you similarly reduce the activations breaking the model.
This happens on any transformer with softmax attention - worse the longer you train
I wonder if the negative sink weight rejection is more of an optimization issue. In our original paper describing both sinks and gated attention: https://t.co/ZH8P6j10Jm, we also showed how clipping the softmax gets rid of sink behavior. It must be at least part of the explanation?
Why repetition works so well is still an open question. There's a lot to uncover about training dynamics of SFT, and we hope this is a useful data point.
Joint work with co-authors @Sagar_Vaze@TiRune@y_m_asano
Paper: https://t.co/Jkk1jVPFj5
Code: https://t.co/PoaYWUZbsq
Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.
Looking for cracked full-time Deep Learning researchers on Efficiency, Quantization and Sparsity. Join our world-class applied deep learning research team at Nvidia. Team creates the Nemotron models, we influence the hardware with our research. Shoot me a message! Am at Neurips!
@MinChonChiSF@gu_xiangming@Alibaba_Qwen Known since 2023 btw - https://t.co/OTCvGynbrU <- our outlier paper already used gated attention to get rid of attention-sink behavior.