Tijmen Blankevoort

Verified account

@TiRune

Deep Learning Researcher Nvidia - Efficiency/Numerics

Amsterdam, The Netherlands

Joined May 2009

216 Following

681 Followers

512 Posts

Tijmen Blankevoort

8 days ago

Everytime I reproduce a post like this - a decent AI like ChatGPT actually answers correctly? This problem is frequently just a bias due to the way networks tokenized, and a fast service like Google AI overview, does not do proper thinking. Humans have weird cognitive biases too if you prompt them quickly. Ask someone to quickly answer: If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?

0

0

0

0

197

Tijmen Blankevoort

8 days ago

@norpadon @AleksandrosSob1 Tried doing stochastic rounding instead? We did this for the Nemotron releases for SSM quantization.

0

3

0

0

192

Tijmen Blankevoort

9 days ago

@Halex623 Good idea that Spinquant stuff

0

1

0

0

200

Tijmen Blankevoort

9 days ago

@Tim_Dettmers @suchenzang We can replicate TurboQuant; just works better without the JL stuff and using a Hadamard+Gaussian codebook 😂

0

5

0

1

491

Who to follow

Babak Ehteshami Bejnordi

Research Scientist@Qualcomm AI Research: Deep learning, Conditional computation, Model Efficiency, LLM/Vision

Research Scientist (Director) @Qualcomm AI Research

Efstratios Gavves

Associate Professor & Co-Founder - Dynamical Deep Learning

Tijmen Blankevoort

24 days ago

@CoreAutoAI It turns your deep learning network into a boosting ensemble. Don’t think it’s just an optimizer question.

0

0

0

0

194

Tijmen Blankevoort

about 2 months ago

@hayden_prairie I see we’re back to doing Neural ODEs again with a forward Euler rule.

1

3

0

0

294

Tijmen Blankevoort

about 2 months ago

@didier_lopes @Tim_Dettmers You’ll find that the ‘super weights’ are just very significant weights in the channels causing this large outlier behavior :)

0

1

0

0

26

Tijmen Blankevoort

about 2 months ago

@didier_lopes @Tim_Dettmers Basically, just clipping the large activations is very harmful. If you remove some of the larger weights in the corresponding channels, you similarly reduce the activations breaking the model. This happens on any transformer with softmax attention - worse the longer you train

2

0

0

0

35

Tijmen Blankevoort

about 2 months ago

@ID_AA_Carmack Yup, it’s basically int8 with a lot of dynamic range. Fp16 will also look a lot better! That’s just fp32 without that much range.

0

0

0

0

242

Tijmen Blankevoort

2 months ago

I wonder if the negative sink weight rejection is more of an optimization issue. In our original paper describing both sinks and gated attention: https://t.co/ZH8P6j10Jm, we also showed how clipping the softmax gets rid of sink behavior. It must be at least part of the explanation?

0

0

0

0

36

Tijmen Blankevoort

2 months ago

@MrCatid @mgostIH NVFP4 is scalar, not vector quant! :o

1

0

0

0

28

Tijmen Blankevoort

2 months ago

@tsengalb99 Also pretty impressive they miss citing spinquant and quarot that both apply rotations specifically for the KV-cache 😂

0

1

0

0

110

TiRune retweeted

Dawid Kopiczko @dawkopi

4 months ago

Why repetition works so well is still an open question. There's a lot to uncover about training dynamics of SFT, and we hope this is a useful data point. Joint work with co-authors @Sagar_Vaze @TiRune @y_m_asano Paper: https://t.co/Jkk1jVPFj5 Code: https://t.co/PoaYWUZbsq

1

19

1

9

1K

TiRune retweeted

Bryan Catanzaro

6 months ago

Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.

ctnzr's tweet photo. Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months. https://t.co/v7MKIy7Oe4

41

1K

221

399

506K

Tijmen Blankevoort

6 months ago

@chrisbarber @rronak_ @QuantumArjun @MichaelElabd @jonsidd @schwarzjn_ I’m hiring at Nvidia for efficiency, quantization, sparsity and working on the Nemotron models broadly.

0

1

0

0

261

Tijmen Blankevoort

6 months ago

@RomiLifshitz Thanks! Fixed!

1

1

0

0

95

Tijmen Blankevoort

6 months ago

Looking for cracked full-time Deep Learning researchers on Efficiency, Quantization and Sparsity. Join our world-class applied deep learning research team at Nvidia. Team creates the Nemotron models, we influence the hardware with our research. Shoot me a message! Am at Neurips!

2

10

1

8

2K

Tijmen Blankevoort

6 months ago

@MinChonChiSF @gu_xiangming @Alibaba_Qwen Known since 2023 btw - https://t.co/OTCvGynbrU <- our outlier paper already used gated attention to get rid of attention-sink behavior.

1

0

0

0

53

Tijmen Blankevoort

6 months ago

@Alibaba_Qwen Congrats!

0

1

0

0

274

Last Seen Users on Sotwe

Trends for you

Most Popular Users