Martin Marek

9 days ago

New paper! "Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay"

9 days ago

How much does a language model forget when finetuned on new tasks? We show both model size and optimization matter and forgetting can be nearly eliminated with self-generated replay! https://t.co/Qs9A4n095s w/@mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov 1/8

andrewgwils's tweet photo. How much does a language model forget when finetuned on new tasks? We show both model size and optimization matter and forgetting can be nearly eliminated with self-generated replay!
https://t.co/Qs9A4n095s
w/@mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov 1/8 https://t.co/Z4tTKGcnxA

18

661

88

542

51K

1

28

3

4

3K

1 day ago

We tried self-generating instruction data but it didn't seem to work, at least from our brief testing. Llama-3.2-1B-Instruct is able to generate something that looks like a user prompt but it never generated the end of turn token. I think the reason is that models are usually not trained on prompts (only on the responses). I believe that we also tried using chat datasets (not self-generated) but I don't remember how well it worked. Naively, I would expect chat datasets to be smaller and less diverse than pretraining datasets, but also the format is different. So there might be cases where instruction data is preferable over pretraining data. We haven't really explored this but I think it could be interesting.

0

6

3 days ago

@willccbb Is OPSD as prone to numeric issues as RL?

0

1

0

420

5 days ago

@xidulu @norxornor Agreed https://t.co/sVpzkIBxfk

Protein design scientist. Views are my own.

7 months ago

How should we scale Adam’s hparams with batch size? I had some spare TPUs available so I remastered Figure 4 from our paper on batch size at a higher resolution. Using a 30M language model, we find a constant β₂ half-life (10M tokens) to be optimum across batch sizes.

mrtnm's tweet photo. How should we scale Adam’s hparams with batch size? I had some spare TPUs available so I remastered Figure 4 from our paper on batch size at a higher resolution. Using a 30M language model, we find a constant β₂ half-life (10M tokens) to be optimum across batch sizes. https://t.co/p9i5pEvNW3

2

9

0

436

0

3

0

55

Who to follow

Yi Zhou

@_y1zhou

Kunhao Zheng

@KunhaoZ

The real AGI is the friends we make along the way. PhD in FAIR CodeGen @AIatMeta. Alumni: @Huggingface, Sea AI Lab, @openai, École Polytechnique, SJTU

7 days ago

@cuneytgurcan @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov There's definitely a lot going on in Figure 1. Left plot is pretraining trajectories, starting from a fixed random initialization. Right plot is finetuning trajectories starting from two different checkpoints: one that's Chinchilla-pretrained and one that's overtrained.

1

0

61

8 days ago

@elonmusk Assuming this rewrite reaches a very impressive 50% MFU, does that mean the current training stack is only getting 5% MFU?

0

2

0

115

8 days ago

@garybasin @Pavel_Izmailov @AtakanTekparmak We pretrained small models on 10 to 17,000 tokens per parameter (TPP). For reference Qwen3-0.6B is 60,000 TPP whereas Qwen3-235B-A22B is only 153 TPP. While we haven’t tested models at this large scale, we would certainly expect a large difference in spare capacity.

0

2

0

36

8 days ago

@NicolasZucchet @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Interesting! We actually also tested finetuning on @ZeyuanAllenZhu's synthetic biographies. But we only tested finetuning a pretrained Qwen on this data and looking at general capabilities (@karpathy's CORE eval). We never tested (synthetic → synthetic).

mrtnm's tweet photo. @NicolasZucchet @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Interesting! We actually also tested finetuning on @ZeyuanAllenZhu's synthetic biographies. But we only tested finetuning a pretrained Qwen on this data and looking at general capabilities (@karpathy's CORE eval). We never tested (synthetic → synthetic). https://t.co/XYUGOofxgk

1

2

0

84

8 days ago

@Farfan__ @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Figure 4 is 205M model pretrained on 30B tokens, Figure 9 is Llama-3.2-1B. But yes, we generally study small models. Forgetting is most severe when the model is small and trained for long. We run sweeps over 100B token finetuning jobs, which gets expensive.

0

1

0

64

8 days ago

@_Suresh2 @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Have you tried sampling from the frozen base model, not the updated model?

0

32

mrtnm retweeted

Pavel Izmailov

@Pavel_Izmailov

9 days ago

New paper: https://t.co/LGbYhYytbt The main idea is that we can use an LLM to generate its own replay data to prevent forgetting, as long as we have spare capacity. Very overtrained models have to forget to learn new information.

Pavel_Izmailov's tweet photo. New paper: https://t.co/LGbYhYytbt

The main idea is that we can use an LLM to generate its own replay data to prevent forgetting, as long as we have spare capacity. Very overtrained models have to forget to learn new information. https://t.co/MSG1epE10F

4

168

26

98

14K

9 days ago

@bspectacledGOAT @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Exactly! But if you have access to the pretraining data or data that is very close in distribution, it might be more practical to use that data rather than sampling. (Sampling is somewhat expensive and can be tricky to implement)

0

1

0

70

9 days ago

@atu_tej @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Seems to be the same basic idea! One difference is that this paper uses only NTP loss for the replay data(?) We find KL to work slightly better on self-generated data and much better on off-policy data. And we also relate the regularization strength to model capacity

0

3

0

127

9 days ago

@ThinkDi92468945 @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov We find that models with spare capacity are easier to finetune but we haven't tried expanding the capacity of a model. I think that would be an interesting experiment, although the resulting model might be impractical in terms of inference serving

0

2

0

35

9 days ago

@ThinkDi92468945 @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov Right, that's why we included a Llama experiment https://t.co/9KddsBTuI5

9 days ago

We can even generate replay data from an instruction-tuned LLM. For example, when finetuning Llama-3.2-1B, we can prompt the model with a BOS token (without a chat template) and generate pretraining-like data. With a KL penalty, this data significantly reduces forgetting. 4/8

andrewgwils's tweet photo. We can even generate replay data from an instruction-tuned LLM. For example, when finetuning Llama-3.2-1B, we can prompt the model with a BOS token (without a chat template) and generate pretraining-like data. With a KL penalty, this data significantly reduces forgetting. 4/8 https://t.co/ka2LLaIPOv

2

24

2

0

2K

0

1

0

81

9 days ago

@nrol_ling @andrewgwils @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov There's no filtering of the rollouts whatsoever. We use vanilla T=1 sampling. The only thing we do control is the distribution of the first token (we don't use a generic BOS token).

0

1

0

50

9 days ago

@difficultyang Likely to be the case https://t.co/aGPlrgWIzP

9 days ago

When does forgetting still happen? When the model has no spare capacity. Small models trained to saturation cannot absorb new information without overwriting old information. 5/8

andrewgwils's tweet photo. When does forgetting still happen? When the model has no spare capacity. Small models trained to saturation cannot absorb new information without overwriting old information. 5/8 https://t.co/2vneaTliAA

2

30

3

2K

0

1

0

144

9 days ago

@fleetwood___ https://t.co/vHSJGki0T8