At CVPR this week for a talk on neural geometry of large vision models. If you’re interested in interpretability or joining @GoodfireAI, come say hi. 🤠
My big takeaway from our new work: saturation is the underrated key to learning. Always think about what concepts are saturating, because that’s when you get to learn the next one.
@ericjmichaud_ Thanks! FWIW, the toy task and theory are super simple and arguably a special case of some of the prior works (as we noted in related work discussion). However, the fact that the predictions of this account generalize to large scale models was very exciting to me. 😁
Very excited to have this paper out! We show by having more parameters, larger models see reduced interference between updates. This allows them to retain memories of rarely observed samples of a task, eventually allowing them to learn even the tail-end of the distribution. (1/3)
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
@solacebellamy@ChrisGPotts FWIW, one could argue such an experiment has happened before. GPT-3 was 175B params and is easily outperformed by models of 7--8B scale, because these models were trained on more and better data.
@solacebellamy@ChrisGPotts Your inference is correct, but the suggestion requires us knowing what to train on. We'll have to change the data mixture in a manner that allows us to get, from a standard training pipeline, the performance of a larger model. (1/2)
@jiaxinwen22@ChrisGPotts Hmm, curious to hear more. For context, the above is Olmo pretraining data and the task is just comparing numbers. Comparison for ordered concepts can be expected to be present in general pretraining data, and if you can isolate such data, I expect you'll see curves like this.
@jiaxinwen22@ChrisGPotts Yeah agree with this. We made an intentional choice early on in the project to not bother with shared task structures just yet, since we didn't know what the dynamics without shared structure looked like. However, we fully intend to follows up soon. :)
Check out @AndrewLampinen's post and the shoutout to our work: I really loved the emphasis on inability to learn, more than the ability to forget! Gotta ask the right questions. :)
Speaking of blog posts, our coauthor @AndrewLampinen just did a post that, among other things, relates our results to themes of continual learning and catastrophic interference: https://t.co/YRHEhRJs6d
We've done a lot of work in the past on understanding how scaling enables learning of new abilities (see https://t.co/CNC5eXLNdk), but this is the first time we attacked param scaling, finding that for a given training process, having more params offers a genuine advantage. (2/3)
The most popular way to interpret AI is missing the bigger picture.
Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines.
Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)