Jacob Springer @jacspringer - Twitter Profile

Pinned Tweet

about 1 year ago

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

jacspringer's tweet photo. Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease your performance after post-training!

Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇

1/9 https://t.co/TpCDgZ862C

17

809

173

637

165K

jacspringer retweeted

Jingchu Gai @Jingchug

6 days ago

1/6New paper! 🧵 We find that LLMs often commit to an answer BEFORE they finish reasoning — the rest of the CoT is just post-hoc rationalization. We call this "premature confidence." 📄 https://t.co/WfRJI1AtBL

Jingchug's tweet photo. 1/6New paper! 🧵
We find that LLMs often commit to an answer BEFORE they finish reasoning — the rest of the CoT is just post-hoc rationalization.
We call this "premature confidence."
📄 https://t.co/WfRJI1AtBL https://t.co/lyoLnj7fpH

7

104

19

75

11K

jacspringer retweeted

Gaurav Ghosal @gaurav_ghosal

15 days ago

Had a great time working on this project exploring how to proactively prevent forgetting of capabilities during subsequent training! All credit goes to @lawrencefeng17 for leading it so skillfully!

1

13

3

4

2K

jacspringer retweeted

Lawrence Feng

@lawrencefeng17

15 days ago

1/ To retain post-training capabilities after further fine-tuning, mix that data into pretraining. The effect can be invisible until fine-tuning begins; early exposure may not help post-training performance, but it changes what persists. How a model learns a task matters.

6

86

24

56

27K

Who to follow

Mingjie Sun

@_mingjiesun

Member of Technical Staff @thinkymachines | prev CS PhD @CSDatCMU

Yuda Song

@yus167

PhD @mldcmu. Previously @ucsd_cse @UcsdMathDept

Yiding Jiang

@yidingjiang

Research @GoogleDeepMind | Prev: PhD @mldcmu, AI resident @GoogleAI, BS @Berkeley_EECS. Trying to understand stuff.

jacspringer retweeted

Aditi Raghunathan

@AdtRaghunathan

22 days ago

It's one of the first lessons in ML: the model with the lowest train loss isn't the one that generalizes best. Pretraining made that easy to forget. You train for one epoch over trillions of tokens, there's no traditional overfitting, and pretrain loss starts to feel like the whole story. Our paper argues it isn't. The lowest-loss model isn't the best starting point for post-training. An old sharp-vs-flat lesson, back in a new regime.

2

143

7

120

21K

Jacob Springer @jacspringer

26 days ago

I also hope our work helps the open source model development community pre-train better models that are easier to fine-tune; would love to see some of this implemented in Marin @percyliang, OLMo @natolambert, or SmolLM @eliebakouch

0

4

0

1

134

Jacob Springer @jacspringer

26 days ago

Just released a new pretraining paper with some interesting takeaways: - sharpness minimization is important but it doesn’t show its benefit until *after* you post-train - increase your learning rate!! (this is free!) read Ishaan’s thread but I’ll also add my 2 cents below 1/n

Ishaan Watts

@IshaanWatts18

26 days ago

Spending billions to train the "best" base model? You might be optimizing the wrong thing! 🎯 We show that controlling sharpness during mid-training leads to over 35% less forgetting after fine-tuning / quantization... even when the base model itself gets worse. 🧵 Takeaways for pretraining: - Use SAM (Sharpness-Aware-Minimization) in the final steps (~10%) - Try much higher learning rates (yes, even ~10× larger) 1/9

IshaanWatts18's tweet photo. Spending billions to train the "best" base model? You might be optimizing the wrong thing! 🎯

We show that controlling sharpness during mid-training leads to over 35% less forgetting after fine-tuning / quantization... even when the base model itself gets worse.

🧵 Takeaways for pretraining:
- Use SAM (Sharpness-Aware-Minimization) in the final steps (~10%)
- Try much higher learning rates (yes, even ~10× larger)

1/9

31

618

91

440

590K

2

38

9

14

6K

Jacob Springer @jacspringer

26 days ago

But I'm excited to see if we can do better. I would love to see a nanoGPT speedrun benchmark that evaluates models based on how well they can be post-trained. I suspect we'll learn that a lot of the optimization lessons we think we know end up being (at least subtly) wrong.

1

2

0

125

Jacob Springer @jacspringer

about 1 month ago

RT @IshaanWatts18: Obrigado Brazil! 🇧🇷 Had an incredible time at @iclr_conf talking about our work on pretraining optimization. I also had…

0

1

0

41

jacspringer retweeted

Konwoo Kim @konwookim

3 months ago

for data-constrained pre-training, synth data isn’t just benchmaxxing, it lowers loss on the real data distribution as we generate more tokens for even better scaling, treat synth gens as forming one long 𝗺𝗲𝗴𝗮𝗱𝗼𝗰: 1.8x data efficiency with larger gains under more compute

konwookim's tweet photo. for data-constrained pre-training, synth data isn’t just benchmaxxing, it lowers loss on the real data distribution as we generate more tokens

for even better scaling, treat synth gens as forming one long 𝗺𝗲𝗴𝗮𝗱𝗼𝗰: 1.8x data efficiency with larger gains under more compute https://t.co/d2BuB2vT4K

8

369

59

272

101K

jacspringer retweeted

Christina Baek

@_christinabaek

3 months ago

Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵

_christinabaek's tweet photo. Models are typically specialized to new domains by finetuning on small, high-quality datasets.

We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵 https://t.co/stFslu9Mv7

19

612

80

520

94K

jacspringer retweeted

Vaibhav Adlakha

@vaibhav_adlakha

3 months ago

Your LLM already knows the answer. Why is your embedding model still encoding the question? 🚨Introducing LLM2Vec-Gen: your frozen LLM generates the answer's embedding in a single forward pass — without ever generating the answer. Not only that, the frozen LLM can decode the embedding back into text. 🏆 SOTA self-supervised embeddings 🛡️ Free transfer of instruction-following, safety, and reasoning

5

193

37

121

50K

jacspringer retweeted

Suhas Kotha @kothasuhas

3 months ago

to improve fine-tuning data efficiency, replay generic pre-training data not only does this reduce forgetting, it actually improves performance on the fine-tuning domain! especially when fine-tuning data is scarce in pre-training (w/ @percyliang)

kothasuhas's tweet photo. to improve fine-tuning data efficiency, replay generic pre-training data

not only does this reduce forgetting, it actually improves performance on the fine-tuning domain! especially when fine-tuning data is scarce in pre-training (w/ @percyliang) https://t.co/ClGPAUlPqQ

15

495

64

376

73K

Jacob Springer @jacspringer

3 months ago

the rank of llm representations / weights has recently been such a hot topic, with multiple papers arguing that rank is a good predictor of performance it turns out, our paper shows it's mainly hyperparameters that determine the rank! read Atharva's thread ↓

jacspringer's tweet photo. the rank of llm representations / weights has recently been such a hot topic, with multiple papers arguing that rank is a good predictor of performance

it turns out, our paper shows it's mainly hyperparameters that determine the rank!

read Atharva's thread ↓ https://t.co/sNDdKkU8Rm

Atharva Kulkarni @athrvkk

3 months ago

Is the geometry of language model weights really predictive of performance?🔍 Our new work challenges the popular hypothesis that low rank unembedding matrix hurts LLM performance; and the answer is more complicated than you'd think! https://t.co/0YxcZmNpfb (1/8)

athrvkk's tweet photo. Is the geometry of language model weights really predictive of performance?🔍

Our new work challenges the popular hypothesis that low rank unembedding matrix hurts LLM performance; and the answer is more complicated than you'd think!

https://t.co/0YxcZmNpfb

(1/8) https://t.co/SFWsojFQnn

1

34

6

26

3K

0

4

1

0

154

jacspringer retweeted

Ziqian Zhong

@fjzzq2002

3 months ago

🔭 We’re releasing Hodoscope: an open-source tool for unsupervised behavior discovery. It lets you visually explore and compare agent behaviors at scale. It helped us discover a novel reward hacking vulnerability in Commit0 - with just a couple minutes of human effort.

28

1K

154

1K

74K

jacspringer retweeted

Fahim Tajwar @FahimTajwar10

4 months ago

Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: https://t.co/j9BCBF7K3R 🧵 1/n

14

798

160

726

208K

jacspringer retweeted

Yuda Song @yus167

4 months ago

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

yus167's tweet photo. RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong."

Can we train LLMs on this human-AI interaction?

We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵 https://t.co/i8ncPFKq70

14

595

102

494

107K

jacspringer retweeted

Vaishnavh Nagarajan @_vaishnavh

5 months ago

1/ We found that deep sequence models memorize atomic facts "geometrically" -- not as an associative lookup table as often imagined. This opens up practical questions on reasoning/memory/discovery, and also poses a theoretical "memorization puzzle."

58

1K

244

1K

92K

Jacob Springer

@jacspringer

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users