🚨New Preprint! Wondered how grid cells form multiple discrete modules? Interested in continuous attractors and modularity? With @FieteGroup, we discover + generalize a physical mechanism for forming modules from smoothly varying parameters in a dynamical system!👇(1/15)
couldn’t agree more
my bias since day one: deep learning is absurdly flexible to succeed, if math/physics don’t forbid it, and we get *opt & data* right.
it just works
Parallax/Muon is one example; models with dynamics such as feedback loops yet another happening rn
the broader implication is that there's abandoned architecture research from before Muon that failed because the empirical optimizers that worked in practice were, both literally and conceptually, stuck in element-wise local minima
For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention.
Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones.
paper: https://t.co/yAqClXrJUz
code: https://t.co/D4pgIr1wM7
For the origin of Parallax, check out the LLA paper at ICLR 2026:
paper: https://t.co/85OzoOQlnF
code: https://t.co/eqMYZ0U6qO
@_arohan_@torchcompiled True (but figure 1 shows AdamW can’t be saved by good signal prop init)
Pieces of the puzzle have been there before for sure — https://t.co/YOu6CBTe6M
No Neocloud ever imagined they’d be renting out H100s today at higher prices than 3 years ago.
Even if you have money, frontier labs and Neolabs have already locked up most of the 2026 GPU supply.
There is basically infinite demand for artificial intelligence.
Our new paper was accepted at ICML!
1) Momentum isn’t just “SGD but faster”.
It affects sharpness (of orders of magnitude!)
2) The usual story says momentum lets you train in sharper regions.
That’s true for large batches only! The opposite is true for minibatches!
Orthogonal Finetuning (https://t.co/IlBYlgiaae; https://t.co/Mve4Pdptmv) has a unique advantage of preventing catastrophic forgetting. Inspired by this property, we find that merging models within the orthogonal group can effectively reduce model conflicts and preserve both pretraining and downstream knowledge. This is our OrthoMerge framework.
The idea behind OrthoMerge is extremely simple. For OFT-tuned models, we can first map the orthogonal adapters to Lie algebra with inverse Carley transform and then perform merging there. This guarantees the merged model differs from the pretrained model only up to an orthogonal transformation.
A better news is that OrthoMerge can also be applied to non-OFT-tuned models. By solving the orthogonal procrustes problem, we can have the projected component of the adapter onto the orthogonal group. OrthoMerge will then be applied there and the residual component can be merged using conventional merging methods. That said, OrthoMerge can be used together with existing model merging methods!
This is a great example of simple yet effective ideas. Great efforts by my PhD students Sihan Yang and Kexuan Shi. The project is already open-sourced and feel free to give it a try!
Project: https://t.co/Fzjrn0zpaW
Paper: https://t.co/QvFafN1UeY
Code: https://t.co/LjEzcLZ0De