I'm excited to announce my new lab: UCSD's Learning Meaning and Natural Language Lab.
a.k.a. LeM🍋N Lab!
And 📢WE ARE RECRUITING📢 PhD students to join us in sunny San Diego in either Linguistics OR Data Science. Apply by Dec 4: https://t.co/gCYN8eMk4A
More about the lab👇
Did you know that a constant learning rate followed by a cooldown works just as well as a cosine schedule? Also, it allows for training your LMs without fixing the number of steps beforehand, enabling cheaper scaling laws. Check out @haeggee’s new paper! https://t.co/LHwUAkCF0A
Why exactly do we train LLMs with the cosine schedule, still?🤔 Maybe we do not actually have to -- and that would come with a lot of benefits :)
🧵Our paper on LR schedules, compute-optimality and more affordable scaling laws
Why exactly do we train LLMs with the cosine schedule, still?🤔 Maybe we do not actually have to -- and that would come with a lot of benefits :)
🧵Our paper on LR schedules, compute-optimality and more affordable scaling laws
LLMs can do amazing things these days—not only in their main language (English?), but also in other ones! Our paper identifies a surprising *potential* reason why: language imbalance! (see caveats in 🧵!)
https://t.co/ToAb1L5HdO
+ @ravfogel T. Hofmann @tpimentelms@ImanolSchlag
@ManuelFaysse Although we couldn’t test at larger scales where semantic content might play a larger role. Interesting to see your comparisons with TinyLlama here! Nice work 🥐
@ManuelFaysse Agree that 50/50 EN/FR still seems like a good choice in this context, especially for French knowledge! Note that in our experiments benefits of imbalance were less clear for real languages:
https://t.co/JqpKpdnZlq
When investigating real languages, we still see lower-resource languages benefit from the main language. Yet, if imbalance itself causes better generalization is less clear. Benefits diminish with longer training and hidden states are not more aligned in the imbalanced setting.
@wendlerch@ravfogel@tpimentelms@ImanolSchlag Yes:
- For cloned languages (as in the plot), 90/10 is equivalent to 10/90 as the languages are equivalent.
- For EN and FR, we experiment with imbalances in both directions (Fig 3). The results are generally very symmetric. We just focus on the EN>FR direction to avoid clutter.
@annwitbrock@Dorialexander@alexjc These trends are not as clear for real languages https://t.co/JqpKpdnZlq
Check out the paper for details! We also have results on the impact of vocabulary overlap on generalization. This might be interesting in the context of languages with different scripts
When investigating real languages, we still see lower-resource languages benefit from the main language. Yet, if imbalance itself causes better generalization is less clear. Benefits diminish with longer training and hidden states are not more aligned in the imbalanced setting.
Overall, our results suggest an interesting feature of LM training dynamics: in some settings, having a dominant main language can aid sharing of model components across languages. Yet, leveraging such benefits in real multilingual settings isn’t as straightforward as we’d like.
When investigating real languages, we still see lower-resource languages benefit from the main language. Yet, if imbalance itself causes better generalization is less clear. Benefits diminish with longer training and hidden states are not more aligned in the imbalanced setting.
Does this mean we can improve performance by e.g. injecting character-level information? It’s not that easy: We find that naturally occurring near duplicates may not be as similar as anticipated. This limits the potential for performance improvements. Check our paper for details!
Did you know that most LLM’s vocabularies contain around 40% near duplicate entries? Check out our new work to learn more about how this may affect your model’s training efficiency!
https://t.co/3k2vjLHuKq (details in thread)
with T. Hofmann @ImanolSchlag@tpimentelms
We first investigate perfectly equivalent duplicates. While the model learns to align their representations, we find that duplication consistently leads to lower training efficiency. With a 40% duplicate rate (the typical LLM rate), we get a 10% decrease in data efficiency!