Preprint 🧵! How compartmentalized are LLMs?
For data in different formats (English/Chinese, Wiki/Q&A), how much transfer occurs? We provide evidence that LLMs can struggle with this sort of transfer, with consequences like sample inefficiency and capacity competition.
The setup: train on text split 50/50 across disjoint token vocabs, and, vs. an "A-only" basteline -
1️⃣ The sample efficiency gap shows up and persists across scale up to 1B
2️⃣ Representations are near-totally orthogonal; each split uses capacity independently, with a higher overall val loss plateau
3️⃣ We show that unified, lower loss solutions exist but SGD doesn't find them from a generic init
4️⃣ Massive parallel data fails to bridge representations
This looks to us like a potential problem in LLMs, and it also gives us the "no-sharing" baseline we're looking for.
We use this to find a preliminary result for natural multilingual transfer. ➡️
We build on existing work showing that frontier performance on all sorts of transfer is more inconsistent than we might hope, especially after learning from trillions of tokens:
https://t.co/mYBiTyVoWk @NitCal
https://t.co/Au95cAwhWX @omerNLP
https://t.co/AC6IahZYI4 @LChoshen
Wanna check how well a model can share knowledge between languages? Of course you do! 🤩
But can you do it without access to the model’s weights? Now you can with ECLeKTic 🤯
The big labs are betting RL will unlock superhuman coding. But their infrastructure is closed, and OSS tooling doesn't support true online RL—just iterative batch optimization.
We're releasing ARES to close that gap 🧵
Thanks to:
- @grantpitt0, who helped create the original idea, provided invaluable feedback, and helped me debug a few cursed numerical bugs.
- @fleetwood___ for help with Ratchet (and pushing me to write a blog post).
- @bgub_ for helpful feedback.
💜
Train a language model in your browser with WebGPU!
I built a playground for training sequence models (Transformers, LSTMs, GRUs, vanilla RNNs) completely in your browser on synthetic tasks like sorting and simple natural language datasets like TinyStories. You can fiddle with 50+ experiment knobs to build your own model, which can be as big as you have the VRAM to accommodate.
You don't have to install anything—all you need is a browser with WebGPU support.
Check it out! Link to repo + blog post + features and technical details in the reply. 🧵