New preprint: "Stability and Generalization in Looped Transformers"
Looped transformers are having a moment. Part of their appeal is the theoretical possibility of generalizing to harder problems simply by running more loops. But in practice, that often fails. 🧵
One main problem of post-norm in fixed-depth transformers is the matrix power of the Jacobian creating vanishing gradients in backprop. I've discussed this more in my own work, but it's interesting how looped models seem to need this Jacobian power to remain stable.
Introducing 🔁 Awesome-Loop-Models: a curated repo for keeping up with loop models!
Whether you are just entering the field or have been exploring loop models for a while, this repo is built to serve as an actively updated map for mechanism analysis, architecture and algorithm design, applications, and related directions.
🧵 [1/n]
@RidgerZhu Very cool write up. In my own work I’ve found that smaller looped models w/ higher LR often mimic many of the problems of larger models w/ lower LR, which makes it easier to avoid unstable architectures before scaling
@DimaKrotov The idea of "reasoning in latent space" is what got me working on looped transformers in the first place. Really cool to see the energy framing, I think there's some clean relationships with looped transformers and energy minimization at basins.
Companies love to talk about how long reasoning times 'solve' intelligence. This paper shows that how you use the reasoning loop and create the right iteration architecture matters a lot.
@josephdviviano When I started working with looped TFs ~a year ago, I was constantly annoyed at how unpredictably they failed. Ended up writing theory on when this happens -- hopefully it saves future researchers those first few months.
https://t.co/x9xC9Zp4y9
New preprint: "Stability and Generalization in Looped Transformers"
Looped transformers are having a moment. Part of their appeal is the theoretical possibility of generalizing to harder problems simply by running more loops. But in practice, that often fails. 🧵
@JFPuget One of the theoretical benefits of looped transformers in particular is their ability to run for **more** loops than in training to solve harder problems. Whether they do in all cases is... complex
https://t.co/x9xC9Zp4y9
New preprint: "Stability and Generalization in Looped Transformers"
Looped transformers are having a moment. Part of their appeal is the theoretical possibility of generalizing to harder problems simply by running more loops. But in practice, that often fails. 🧵
Full paper: https://t.co/bkxrXA1kLN
I’ll be at ICLR in Rio next week presenting a different paper on tabular ML.
If you’re working on looped/recurrent models, test-time compute, or tabular ML, I’d love to chat in person.
New preprint: "Stability and Generalization in Looped Transformers"
Looped transformers are having a moment. Part of their appeal is the theoretical possibility of generalizing to harder problems simply by running more loops. But in practice, that often fails. 🧵
I find:
- Without recall, looped models act like basin selectors rather than smooth input-dependent algorithms
- Recall helps preserve input dependence, but models are often still fragile
- Outer normalization broadens the parameter regions over which the models are stable
@heyanuja@papertrailshq So funnily enough, I saw that post a few months ago and started making my own version -- never got far since I had other projects, but I looked into research databases and there are really cool existing open-source ones that would just require API calls/downloading, no scraping!