Dawid Kopiczko

@dawkopi

PhD in progress

Joined April 2020

486 Following

137 Followers

43 Posts

Pinned Tweet

Dawid Kopiczko @dawkopi

4 months ago

Why repetition works so well is still an open question. There's a lot to uncover about training dynamics of SFT, and we hope this is a useful data point. Joint work with co-authors @Sagar_Vaze @TiRune @y_m_asano Paper: https://t.co/Jkk1jVPFj5 Code: https://t.co/PoaYWUZbsq

1

19

1

9

1K

dawkopi retweeted

Yuki @y_m_asano

about 1 month ago

🎉[openings] I’m hiring postdoctoral researchers to join our @FunAILab at UTN through the Alexander von Humboldt Research Fellowship (@AvHStiftung), via the Henriette Herz Scouting Programme. As a Henriette Herz Scout, I can nominate outstanding international researchers for this fellowship route. I’m especially keen to hear from candidates working on multimodal learning, video and image pretraining, and post-training. Fellows would be hosted in our lab at UTN and work closely with us on these topics. Key requirements: * finished your doctoral studies less than 4 years ago or will finish in the next 6 months * did not live/work in Germany in the last 10 years * applications from female, trans* and/or non-binary candidates are highly encouraged! Interested? Please send a short note with your CV, PhD year, current affiliation, 2–3 key publications, and a few lines on how your work connects. Please share! 🔀

y_m_asano's tweet photo. 🎉[openings] I’m hiring postdoctoral researchers to join our @FunAILab at UTN through the Alexander von Humboldt Research Fellowship (@AvHStiftung), via the Henriette Herz Scouting Programme.

As a Henriette Herz Scout, I can nominate outstanding international researchers for this fellowship route. I’m especially keen to hear from candidates working on multimodal learning, video and image pretraining, and post-training.

Fellows would be hosted in our lab at UTN and work closely with us on these topics.

Key requirements:
* finished your doctoral studies less than 4 years ago or will finish in the next 6 months
* did not live/work in Germany in the last 10 years
* applications from female, trans* and/or non-binary candidates are highly encouraged!

Interested? Please send a short note with your CV, PhD year, current affiliation, 2–3 key publications, and a few lines on how your work connects.

Please share! 🔀

1

61

23

11

6K

Dawid Kopiczko @dawkopi

about 1 month ago

@elder_plinius https://t.co/nKJTladZA3

Dawid Kopiczko @dawkopi

about 1 month ago

@yacinelearning not a question but relevant -- you can directly ask gpt5.5 what it means by "goblin"; it's more or less an incentive or "sub-agent"

dawkopi's tweet photo. @yacinelearning not a question but relevant -- you can directly ask gpt5.5 what it means by "goblin";
it's more or less an incentive or "sub-agent" https://t.co/4sDjUBI8Gd

3

4

0

1

216

0

2

1

0

57

Dawid Kopiczko @dawkopi

about 1 month ago

@yacinelearning that definition makes sense if you look at examples reported by others: https://t.co/71bRAb31YI

dawkopi's tweet photo. @yacinelearning that definition makes sense if you look at examples reported by others: https://t.co/71bRAb31YI https://t.co/rnofvjyYCF

Tara Viswanathan

@TaraViswanathan

about 1 month ago

@arb8020 !!!!! I was wondering why my claw suddenly became a goblin with codex 5.5 😭💀😂

TaraViswanathan's tweet photo. @arb8020 !!!!! I was wondering why my claw suddenly became a goblin with codex 5.5 😭💀😂 https://t.co/AACWtNcgQl

8

375

6

23

30K

0

3

0

0

69

Dawid Kopiczko @dawkopi

about 1 month ago

@yacinelearning not a question but relevant -- you can directly ask gpt5.5 what it means by "goblin"; it's more or less an incentive or "sub-agent"

dawkopi's tweet photo. @yacinelearning not a question but relevant -- you can directly ask gpt5.5 what it means by "goblin";
it's more or less an incentive or "sub-agent" https://t.co/4sDjUBI8Gd

3

4

0

1

216

Dawid Kopiczko @dawkopi

4 months ago

@DimitrisPapail added all 45 checkpoints of Olmo3-7B here: https://t.co/W2OnUGWMjz

0

1

0

0

20

Dawid Kopiczko @dawkopi

4 months ago

@DimitrisPapail (I'm uploading all ckpts to HF, so it will be possible to play with models trained with diff epoch-samples ratio)

1

1

0

0

31

Dawid Kopiczko @dawkopi

4 months ago

@ChinmayKak actually something similar was observed by @AlexGDimakis when working on OpenThoughts dataset; sampling multiple trajectories for the same prompt, instead of drawing more unique prompts led to better results https://t.co/q85EmybXSI

6 months ago

The multiple answers mystery is the most surprising thing we stumbled on from OpenThoughts: Sampling multiple answers for the same question is better than having more questions, each answered once. To explain: Say you are creating a dataset of questions and answers to SFT a reasoning llm. You can take 1000 questions (eg from stackexchange) and answer them with deepseekR1. Or you can take 500 questions (from the same distribution) and answer each question *twice* independently with deepseekR1. Which one is a better dataset? Surprisingly, if you re-answer the same questions , it’s a better dataset for distillation (at the same size) and this was a robust finding from OpenThoughts across models and data sources. We have no theoretical understanding why, and no way to predict how many times to repeat. Clearly it must stop at some point (take one question and answer it 1000 times won’t be a good SFT dataset) but we don’t know how to predict this, beyond empirically trying.

AlexGDimakis's tweet photo. The multiple answers mystery is the most surprising thing we stumbled on from OpenThoughts:
Sampling multiple answers for the same question is better than having more questions, each answered once.

To explain: Say you are creating a dataset of questions and answers to SFT a reasoning llm.

You can take 1000 questions (eg from stackexchange) and answer them with deepseekR1.
Or you can take 500 questions (from the same distribution) and answer each question *twice* independently with deepseekR1.

Which one is a better dataset? Surprisingly, if you re-answer the same questions , it’s a better dataset for distillation (at the same size) and this was a robust finding from OpenThoughts across models and data sources.

We have no theoretical understanding why, and no way to predict how many times to repeat. Clearly it must stop at some point (take one question and answer it 1000 times won’t be a good SFT dataset) but we don’t know how to predict this, beyond empirically trying.

14

224

27

152

36K

1

1

0

0

41

Dawid Kopiczko @dawkopi

4 months ago

Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT. Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close.

dawkopi's tweet photo. Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT.
Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close. https://t.co/5GFlhoawwv

3

311

26

252

20K

Dawid Kopiczko @dawkopi

4 months ago

@ChinmayKak @Sagar_Vaze @TiRune @y_m_asano there's this work on data-constrained pretraining (https://t.co/roNjmBIgpo), where they show that multiple epochs can substitute unique data *up to a few epochs*; while more epochs slows down convergence, at least measured by val loss

0

2

0

0

43

Dawid Kopiczko @dawkopi

4 months ago

@ysu_ChatData total tokens seen during training is more or less the same within each update budget; as we report in the paper -- there are "standard" overfitting signs like train set memorization and rising val loss, but the model generalizes well nevertheless

0

0

0

0

168

Dawid Kopiczko @dawkopi

4 months ago

@yacinelearning as RAFT is basically SFT on filtered (on-policy) data, you might find this phenomenon interesting: https://t.co/d6fIEkb1Me

Dawid Kopiczko @dawkopi

4 months ago

Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT. Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close.

dawkopi's tweet photo. Common knowledge in ML: more unique training data → better generalization. Turns out this doesn't hold for long-CoT SFT.
Under a fixed update budget, repeating a small dataset multiple times beats training on more unique samples. And it's not even close. https://t.co/5GFlhoawwv

3

311

26

252

20K

2

2

1

0

132

Dawid Kopiczko @dawkopi

4 months ago

(too late to edit, config matching mentioned results is: "16 epochs on a random subset of 3.2K* samples") 16 epochs on 400 samples yields: 83% -- AIME'24, 63% -- AIME'25, 66% -- GPQA.

dawkopi's tweet photo. (too late to edit, config matching mentioned results is: "16 epochs on a random subset of 3.2K* samples")

16 epochs on 400 samples yields:
83% -- AIME'24,
63% -- AIME'25,
66% -- GPQA. https://t.co/YRZNoOIMQL

0

0

0

0

68

Dawid Kopiczko @dawkopi

4 months ago

When training Olmo-3 7B on Dolci SFT dataset, 16 epochs on a random subset of 400 samples leads to: 80% (pass@16) on AIME'24, 63% on AIME'25, 62% on GPQA; while one epoch on over 51K samples yields: 47% -- AIME'24, 50% -- AIME'25, 24% -- GPQA.

dawkopi's tweet photo. When training Olmo-3 7B on Dolci SFT dataset, 16 epochs on a random subset of 400 samples leads to:
80% (pass@16) on AIME'24,
63% on AIME'25,
62% on GPQA;
while one epoch on over 51K samples yields:
47% -- AIME'24,
50% -- AIME'25,
24% -- GPQA. https://t.co/eAURo1mynK

2

18

0

2

1K

Dawid Kopiczko @dawkopi

4 months ago

Why repetition works so well is still an open question. There's a lot to uncover about training dynamics of SFT, and we hope this is a useful data point. Joint work with co-authors @Sagar_Vaze @TiRune @y_m_asano Paper: https://t.co/Jkk1jVPFj5 Code: https://t.co/PoaYWUZbsq

1

19

1

9

1K

Dawid Kopiczko @dawkopi

4 months ago

So when does repetition stop helping? Turns out token accuracy on training dataset is a pretty reliable signal for this. Once the model hits ~100% token accuracy on the training set, additional epochs don't bring further gains, which makes it a nice practical stopping criterion.

dawkopi's tweet photo. So when does repetition stop helping?
Turns out token accuracy on training dataset is a pretty reliable signal for this. Once the model hits ~100% token accuracy on the training set, additional epochs don't bring further gains, which makes it a nice practical stopping criterion. https://t.co/N1LGasSj4w

1

14

0

4

868

Dawid Kopiczko @dawkopi

4 months ago

@DimitrisPapail there were *signs* that the phenomenon exists tho -- many tech reports mention multiple epochs in SFT stage, or eg. this paper (https://t.co/MhYwJ1kUkR) training for 15 epochs on 800 samples; but they focus on data quality, and do not ablate epochs

0

1

0

1

51

Dawid Kopiczko @dawkopi

4 months ago

@DimitrisPapail yup, for 7-8B models and this dataset it seems optimal; but for example 4B model gets saturated around 4-8; it's either implicitly due to smaller model, or explicitly due to larger optimal learning rate (3e-5 vs 2e-5 for 7-8B models)

0

0

0

0

23

Dawid Kopiczko @dawkopi

4 months ago

@DimitrisPapail yeah, it looks like standard overfitting as val loss goes up, while train loss goes to 0 -- but the model generalizes well; we can train on 200 generic conversation samples which demonstrate reasoning patterns, and the model starts solving ~40% of AIME'25 problems

0

0

0

0

14

Last Seen Users on Sotwe

Trends for you

Most Popular Users