Anton Schäfer @antonschafer - Twitter Profile

antonschafer retweeted

Anthropic

@AnthropicAI

4 months ago

A statement on the comments from Secretary of War Pete Hegseth. https://t.co/Gg7Zb09IMR

3K

42K

6K

5K

18M

antonschafer retweeted

Alex Warstadt @a_stadt

over 1 year ago

I'm excited to announce my new lab: UCSD's Learning Meaning and Natural Language Lab. a.k.a. LeM🍋N Lab! And 📢WE ARE RECRUITING📢 PhD students to join us in sunny San Diego in either Linguistics OR Data Science. Apply by Dec 4: https://t.co/gCYN8eMk4A More about the lab👇

a_stadt's tweet photo. I'm excited to announce my new lab: UCSD's Learning Meaning and Natural Language Lab.
a.k.a. LeM🍋N Lab!

And 📢WE ARE RECRUITING📢 PhD students to join us in sunny San Diego in either Linguistics OR Data Science. Apply by Dec 4: https://t.co/gCYN8eMk4A

More about the lab👇 https://t.co/ZX1quedeUc

12

442

76

143

39K

Anton Schäfer @antonschafer

about 2 years ago

Did you know that a constant learning rate followed by a cooldown works just as well as a cosine schedule? Also, it allows for training your LMs without fixing the number of steps beforehand, enabling cheaper scaling laws. Check out @haeggee’s new paper! https://t.co/LHwUAkCF0A

Alex Hägele @haeggee

about 2 years ago

Why exactly do we train LLMs with the cosine schedule, still?🤔 Maybe we do not actually have to -- and that would come with a lot of benefits :) 🧵Our paper on LR schedules, compute-optimality and more affordable scaling laws

2

121

24

91

39K

0

7

0

1

424

antonschafer retweeted

Alex Hägele @haeggee

about 2 years ago

Why exactly do we train LLMs with the cosine schedule, still?🤔 Maybe we do not actually have to -- and that would come with a lot of benefits :) 🧵Our paper on LR schedules, compute-optimality and more affordable scaling laws

2

121

24

91

39K

Anton Schäfer @antonschafer

about 2 years ago

@wendlerch @ravfogel @tpimentelms @ImanolSchlag Yes! Sharing models and code as soon as I get around to cleaning it up, probably next week

1

2

0

86

Anton Schäfer @antonschafer

about 2 years ago

LLMs can do amazing things these days—not only in their main language (English?), but also in other ones! Our paper identifies a surprising *potential* reason why: language imbalance! (see caveats in 🧵!) https://t.co/ToAb1L5HdO + @ravfogel T. Hofmann @tpimentelms @ImanolSchlag

antonschafer's tweet photo. LLMs can do amazing things these days—not only in their main language (English?), but also in other ones! Our paper identifies a surprising *potential* reason why: language imbalance! (see caveats in 🧵!)

https://t.co/ToAb1L5HdO
+ @ravfogel T. Hofmann @tpimentelms @ImanolSchlag https://t.co/VRKxcquAzP

5

122

25

58

22K

Anton Schäfer @antonschafer

about 2 years ago

@ManuelFaysse Although we couldn’t test at larger scales where semantic content might play a larger role. Interesting to see your comparisons with TinyLlama here! Nice work 🥐

0

27

Anton Schäfer @antonschafer

about 2 years ago

@ManuelFaysse Agree that 50/50 EN/FR still seems like a good choice in this context, especially for French knowledge! Note that in our experiments benefits of imbalance were less clear for real languages: https://t.co/JqpKpdnZlq

Anton Schäfer @antonschafer

about 2 years ago

When investigating real languages, we still see lower-resource languages benefit from the main language. Yet, if imbalance itself causes better generalization is less clear. Benefits diminish with longer training and hidden states are not more aligned in the imbalanced setting.

1

6

1

0

539

1

0

49

Anton Schäfer @antonschafer

about 2 years ago

@wendlerch @ravfogel @tpimentelms @ImanolSchlag Yes: - For cloned languages (as in the plot), 90/10 is equivalent to 10/90 as the languages are equivalent. - For EN and FR, we experiment with imbalances in both directions (Fig 3). The results are generally very symmetric. We just focus on the EN>FR direction to avoid clutter.

1

4

0

188

Anton Schäfer @antonschafer

about 2 years ago

@annwitbrock @Dorialexander @alexjc These trends are not as clear for real languages https://t.co/JqpKpdnZlq Check out the paper for details! We also have results on the impact of vocabulary overlap on generalization. This might be interesting in the context of languages with different scripts

Anton Schäfer @antonschafer

about 2 years ago

When investigating real languages, we still see lower-resource languages benefit from the main language. Yet, if imbalance itself causes better generalization is less clear. Benefits diminish with longer training and hidden states are not more aligned in the imbalanced setting.

1

6

1

0

539

0

2

0

48

Anton Schäfer @antonschafer

about 2 years ago

Overall, our results suggest an interesting feature of LM training dynamics: in some settings, having a dominant main language can aid sharing of model components across languages. Yet, leveraging such benefits in real multilingual settings isn’t as straightforward as we’d like.

0

6

0

353

Anton Schäfer @antonschafer

about 2 years ago

When investigating real languages, we still see lower-resource languages benefit from the main language. Yet, if imbalance itself causes better generalization is less clear. Benefits diminish with longer training and hidden states are not more aligned in the imbalanced setting.

1

6

1

0

539

Anton Schäfer @antonschafer

about 2 years ago

Does this mean we can improve performance by e.g. injecting character-level information? It’s not that easy: We find that naturally occurring near duplicates may not be as similar as anticipated. This limits the potential for performance improvements. Check our paper for details!

0

5

0

395

Anton Schäfer @antonschafer

about 2 years ago

Did you know that most LLM’s vocabularies contain around 40% near duplicate entries? Check out our new work to learn more about how this may affect your model’s training efficiency! https://t.co/3k2vjLHuKq (details in thread) with T. Hofmann @ImanolSchlag @tpimentelms

antonschafer's tweet photo. Did you know that most LLM’s vocabularies contain around 40% near duplicate entries? Check out our new work to learn more about how this may affect your model’s training efficiency!
https://t.co/3k2vjLHuKq (details in thread)
with T. Hofmann @ImanolSchlag @tpimentelms https://t.co/PNesCZpd1i

2

81

11

38

13K

Anton Schäfer @antonschafer

about 2 years ago

We first investigate perfectly equivalent duplicates. While the model learns to align their representations, we find that duplication consistently leads to lower training efficiency. With a 40% duplicate rate (the typical LLM rate), we get a 10% decrease in data efficiency!

1

6

0

1

445

Anton Schäfer

@antonschafer

Last Seen Users on Sotwe

Trends for you

Most Popular Users