David Grangier

@GrangierDavid

ML research with practical impact.

Joined December 2019

56 Following

443 Followers

36 Posts

David Grangier @GrangierDavid

6 months ago

#NeurIPS2025 Mixing different datasets to train your LLM? ✨ We can help you find the perfect blend! 📈 Few small-model experiments → scaling law fit → your optimal mixture. 🎯 Easy + efficient. Chat with us 💬 Poster #3414. Thu, Dec 4, 11am

Mustafa Shukor @MustafaShukor1

11 months ago

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

MustafaShukor1's tweet photo. We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders !

Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵 https://t.co/ISSAo9Ymp2

6

265

46

214

31K

0

5

1

4

2K

David Grangier @GrangierDavid

about 1 year ago

#ICLR #TrainLLMBetter Tomorrow, #soup of experts, an #hypernetwork conditioned on a simple description of the test distribution: adaptation without retraining (Modularity workshop Sunday). https://t.co/Cc72NyyJpI Still on today... CRISP Importance Sampling for LLM pretraining.

GrangierDavid's tweet photo. #ICLR #TrainLLMBetter Tomorrow, #soup of experts, an #hypernetwork conditioned on a simple description of the test distribution: adaptation without retraining (Modularity workshop Sunday). https://t.co/Cc72NyyJpI

Still on today... CRISP Importance Sampling for LLM pretraining. https://t.co/TnXShbWHQp

0

1

0

1

250

David Grangier @GrangierDavid

about 1 year ago

3/3 Mixture of experts on high latency networks with No Need to Talk https://t.co/sMPj55XdDp (Thu Apr 24 3pm). Joint work with @MatPagliardini, @NasFilippova,@PierreAblin @olivia61368522, Skyler Seto, @angeloskath, Ronan Collobert

0

2

0

0

144

David Grangier @GrangierDavid

about 1 year ago

#ICLR #TrainBetterLM I am at ICLR, come to our posters for improved language model training! Recycle gradients for faster neural net training with AdEMAmix https://t.co/eR3r0TSRJH (Fri Apr 25, 10 am). 1/3

GrangierDavid's tweet photo. #ICLR #TrainBetterLM I am at ICLR, come to our posters for improved language model training!

Recycle gradients for faster neural net training with AdEMAmix https://t.co/eR3r0TSRJH (Fri Apr 25, 10 am).

1/3 https://t.co/3nlxuvcgDH

1

8

1

1

1K

Who to follow

Verified account

CEO @GradiumAI. Founder of @kyutai_labs. Invented neural codecs and audio LLMs. Prev. Google DeepMind/Brain, Meta, Toha Heavy Industries.

Verified account

Co-founder @RekaAILabs and Honorary Researcher @Hitz_zentroa (University of the Basque Country) | Past: Research Scientist @AIatMeta (FAIR)

@Microsoft AI, ex-Inflection, Google Brain, DeepMind We are hiring!

David Grangier @GrangierDavid

about 1 year ago

2/3 Importance sampling for better pretraining distribution with CRISP https://t.co/ShxRrGMkDB (Sat Apr 26, 10 am).

1

2

0

2

141

David Grangier @GrangierDavid

over 1 year ago

⦿ Efficient, scalable approach on LM and Q&A domains. ⦿ Single & multitask. ⦿ Pretraining & continued pretraining. ⦿ Ablations on data size, model size... https://t.co/k0EMaZiQfN 4/4

GrangierDavid's tweet photo. ⦿ Efficient, scalable approach on LM and Q&A domains.
⦿ Single & multitask.
⦿ Pretraining & continued pretraining.
⦿ Ablations on data size, model size...

https://t.co/k0EMaZiQfN
4/4 https://t.co/2LnnkEDF84

0

0

0

0

194

David Grangier @GrangierDavid

over 1 year ago

New paper! https://t.co/k0EMaZiQfN Clustered importance sampling to build specialist Language Models (LMs) 🤔 Build a specialist LM with very little specialist data 💡How? Generalist data + efficient, scalable importance sampling w/ @Olivia61368522+SkylerSeto+@PierreAblin 1/4

GrangierDavid's tweet photo. New paper! https://t.co/k0EMaZiQfN
Clustered importance sampling to build specialist Language Models (LMs)

🤔 Build a specialist LM with very little specialist data
💡How? Generalist data + efficient, scalable importance sampling

w/ @Olivia61368522+SkylerSeto+@PierreAblin

1/4 https://t.co/gqN7KE3FKm

1

25

14

11

8K

David Grangier @GrangierDavid

over 1 year ago

🚀Easy with clustered importance sampling: 1️⃣ cluster the generalist dataset, 2️⃣ resample the clusters w/ their prior from tiny specialist data, 3️⃣ Done! 🏁 3/4

1

1

0

0

242

David Grangier @GrangierDavid

over 1 year ago

Ademamix optimizer for jax/pytorch: change one line of code, train your model faster.

Pierre Ablin @PierreAblin

over 1 year ago

🎇Official pytorch/jax implementation of Ademamix🎇 https://t.co/fPQRioY9M0 Drop-in replacement for AdamW, much faster LLM pre-training! 🚀🚀🚀🚀

PierreAblin's tweet photo. 🎇Official pytorch/jax implementation of Ademamix🎇

https://t.co/fPQRioY9M0

Drop-in replacement for AdamW, much faster LLM pre-training! 🚀🚀🚀🚀 https://t.co/LfiF92DLa1

4

188

36

101

16K

0

10

4

5

1K

David Grangier @GrangierDavid

almost 2 years ago

@_arohan_ We do! See Appendix C.1.5 Figure 16 and 17.

GrangierDavid's tweet photo. @_arohan_ We do! See Appendix C.1.5 Figure 16 and 17. https://t.co/onrwD1UNaP

1

3

0

0

196

David Grangier @GrangierDavid

almost 2 years ago

Faster, better model training by reusing old gradients (>10k steps ago) with negligible extra computation? Count me in. https://t.co/qswyspfkJl

Matteo Pagliardini @MatPagliardini

almost 2 years ago

Stop discarding your old gradients! Introducing AdEMAMix, a novel (first-order) optimizer capable of outperforming Adam. Let’s have a thread on momentum and the surprising relevance of very old gradients. A joint work with @GrangierDavid and @PierreAblin #ml #optimization 1/🧵

MatPagliardini's tweet photo. Stop discarding your old gradients! Introducing AdEMAMix, a novel (first-order) optimizer capable of outperforming Adam. Let’s have a thread on momentum and the surprising relevance of very old gradients. A joint work with @GrangierDavid and @PierreAblin #ml #optimization
1/🧵 https://t.co/MbGVcSIPdg

9

290

66

208

67K

1

61

9

33

9K

David Grangier @GrangierDavid

almost 2 years ago

2/2 PN is a high capacity network whose parameters can be linearly projected into a small network. This strategy enables both high capacity and efficient inference. See details at our poster on Friday morning and afternoon. https://t.co/wdtXz9n3yb https://t.co/q4v86N4Wjq

0

0

0

0

195

David Grangier @GrangierDavid

almost 2 years ago

At ICML? Learn about our efficient projected language models! Adding capacity to a traditional language model improves accuracy but increases inference cost. How to avoid this? We propose a novel architecture, projected networks (PN).

GrangierDavid's tweet photo. At ICML? Learn about our efficient projected language models!

Adding capacity to a traditional language model improves
accuracy but increases inference cost. How to avoid this? We propose a novel architecture, projected networks (PN). https://t.co/OFKkv1I1vD

3

26

5

11

3K

David Grangier @GrangierDavid

over 2 years ago

With Angelos Katharopoulos, Pierre Ablin, Awni Hannun.

0

0

0

0

209

David Grangier @GrangierDavid

over 2 years ago

New language model work! In practice, LMs often face a double constraint (i) small inference budget + (ii) little application-specific data: (i) means small specialized models for inference; (ii) means using auxiliary generic data e.g. for pretraining 1/2 https://t.co/E7MrinEcLq

GrangierDavid's tweet photo. New language model work! In practice, LMs often face a double constraint (i) small inference budget + (ii) little application-specific data: (i) means small specialized models for inference; (ii) means using auxiliary generic data e.g. for pretraining 1/2 https://t.co/E7MrinEcLq https://t.co/4Ry5kc7m3r

1

35

13

26

6K

David Grangier @GrangierDavid

over 2 years ago

2/2 Findings: when the application-specific training budget is large, importance sampling is great. Otherwise, asymmetric models (big at train, small at inference e.g. mixture of experts or hyper-networks) are attractive, better than the popular distillation strategy.

1

1

0

0

255

David Grangier @GrangierDavid

over 2 years ago

Our analysis proposes a simple test to check if our method applies to your problem. Chat with us at our poster at #neurips2023 DistShift workshop next week. Joint work with Pierre Ablin, Awni Hannun. (3/3)

0

3

0

0

425

David Grangier @GrangierDavid

over 2 years ago

Efficient bilevel algorithm for training data selection https://t.co/dGWDOin2BJ #bilevel #data_selection #DomainAdaptation #distshift #llm #NeurIPS2023 Online algorithm for filtering large (pre)training sets with maximal impact on the targeted task. (1/3)

GrangierDavid's tweet photo. Efficient bilevel algorithm for training data selection https://t.co/dGWDOin2BJ
#bilevel #data_selection #DomainAdaptation #distshift #llm #NeurIPS2023 Online algorithm for filtering large (pre)training sets with maximal impact on the targeted task. (1/3) https://t.co/E9ZgAjFc0i

1

81

19

45

16K

David Grangier @GrangierDavid

over 2 years ago

Large models are often trained on massive web datasets and a bit of target-task data. In this setup, it is 👍 to spend more train effort on specific parts of the large set. Our online algorithm maintains an auxiliary cheap filter model when training the large model. (2/3)

1

3

0

0

542

Last Seen Users on Sotwe

Trends for you

Most Popular Users