1/4
New paper with @weijie444!
We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update!
📝 https://t.co/M9z868by4q
💻 https://t.co/BCILN5V5mp
RIP Prof. Dimitri Bertsekas. His textbooks on optimization and RL have been the standard reference and inspiration for many years for optimization researchers like me.
Introducing Compositional Muon, an optimizer that extends Muon from individual matrices to composed transformer circuits.
Modern optimizers usually draw trust regions around individual parameters. But in attention, the loss often sees compositions like QK^T and OV. Updating each factor independently can therefore control the wrong object.
Compositional Muon closes this gap by deriving partner-whitened update rules. Each factor’s update is shaped by the spectral geometry of the matrix it is composed with, producing more stable composed updates and better effective learning-rate allocation across heads and layers.
For QK, this gives a head-local half-split rule. For OV, the circuit geometry selects a hybrid rule: (V) is optimized per-head, while (W_O) is optimized as the single matrix that aggregates all heads back into the residual stream.
CM improves over Muon at 340M and 1B scale, transfers to the modded-nanoGPT optimization benchmark, and can be approximated cheaply as partner-rescaled Muon via the isotropic rule.
The broader point is optimizer-architecture co-design: better optimizers should not only ask how to update a parameter, but what composed circuit that parameter participates in. CM is one step toward optimizers that respect the functional structure the loss actually sees.
6/6 This also answers @CsabaSzepesvari's earlier question: is there an easy-to-explain benefit beyond "it just works"? Yes. When a symmetry is redundant, an unaware optimizer still drifts along it for nothing. Respecting it removes that drift, which shows up as controlled final-logit growth, not only lower loss: https://t.co/OiPpRsKzyw
1/6 🧵 New revision of our paper on symmetry-compatible optimizer design (w/ @weijie444). The update sharpens one idea: softmax shift-invariance in LM heads (and MoE routers) isn't a footnote. It's a quotient geometry the optimizer should respect. And we back it with a new logit-control study.
📄 https://t.co/8o7STnPLAs (v2 only had citation updates and part of additional empirical results)
1/4
New paper with @weijie444!
We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update!
📝 https://t.co/M9z868by4q
💻 https://t.co/BCILN5V5mp
@weijie444 5/6 Takeaway: take layerwise symmetry seriously and softmax shift-invariance becomes a design constraint, not a curiosity. The same quotient geometry applies to MoE routers (Section G.4.1), but the LM head logit story is the clearest illustration.
Personal update: I've joined OpenAI while on leave from Wharton. After a decade away, glad to be back in the Bay Area and train AI models here!
One more thing, I've been promoted to full professor, a decade-long journey made possible by many, especially my students.
Exactly. If you've got the driving direction wrong, no amount of tuning the speed, the gear-shift schedule, or bolting accessories onto the car (read: learning rate, LR schedules, momentum, training-objective tweaks, architectural patches) fixes the underlying problem. You'll still arrive at the wrong place, just with different symptoms each time. A lot of the vectorized-optimizer tuning era reads this way in retrospect.
Which is the thesis behind our symmetry-compatible optimizer work (https://t.co/M9z868by4q): the update rule should be equivariant under the symmetry group acting on each weight block, derived per block type. We work this out end-to-end as a layerwise optimizer stack covering embeddings, LM heads, SwiGLU MLPs, and MoE routers, i.e. the whole modern transformer/MoE, not a single block in isolation.
Your second point falls out of this. Once update direction respects symmetry, architecture design escapes the same flattening assumption, and the two can be co-designed.
@plugyawn Hmm maybe but it feels like classical vectorization based optimizers are just fundamentally the wrong way of looking at deep learning. Not just in optimizer design but even in the design of architectures themselves…
1/4
New paper with @weijie444!
We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update!
📝 https://t.co/M9z868by4q
💻 https://t.co/BCILN5V5mp
Thanks so much for your interest!
Yes! If the update map does not preserve the equivariance, then the optimizer iterates depend on arbitrary coordinate choices: rotating the input space (orthogonal equivariance) or permutating the output space (permutation equivariance) can change the optimizer itself, leading to different training dynamics under equivalent reparameterizations. Changing the choices of coordinates at each iterate is undesirable, likely leading to slower convergence and potentially unstable training, especially when the dimensions of matrices grow.
Definitely, it is leading to speedup (see Muon vs Adam for an instance of bi-orthogonal equivariance, and embeddings and LM heads for LPRO equivariance in our paper).
There are also implications for initializations/datasets. For instance, for input embeddings and LM heads, if we perform a vocab re-indexing in the tokenizer and re-tokenize the data, we should still get the same model trained on the new tokenized dataset if the optimizer updates preserve the permutation equivariance in the vocab dimension. However, It is not the case if permutation equivariance isn't preserved. Likewise, for orthogonal equivariance, if we perform an orthonormal change of basis for the initializations, spectral optimizers should give the same model but not coordinate-wise adaptive gradient methods. That being said, I could imagine reduced sensitivity to the choice of initializations and other hyperparameters for symmetry-compatible optimizers. We have some minimal base lr sweep for one of the pre-training experiments in the paper.
The current convergence theory only concerns the convergence of layerwise loss functions, so I think it is hard to comment on local optima avoidance or generalization properties at this point.
Thanks so much for your interest!
Yes! If the update map does not preserve the equivariance, then the optimizer iterates depend on arbitrary coordinate choices: rotating the input space (orthogonal equivariance) or permutating the output space (permutation equivariance) can change the optimizer itself, leading to different training dynamics under equivalent reparameterizations. Changing the choices of coordinates at each iterate is undesirable, likely leading to slower convergence and potentially unstable training, especially when the dimensions of matrices grow.
Definitely, it is leading to speedup (see Muon vs Adam for an instance of bi-orthogonal equivariance, and embeddings and LM heads for LPRO equivariance in our paper).
There are also implications for initializations/datasets. For instance, for input embeddings and LM heads, if we perform a vocab re-indexing in the tokenizer and re-tokenize the data, we should still get the same model trained on the new tokenized dataset if the optimizer updates preserve the permutation equivariance in the vocab dimension. However, It is not the case if permutation equivariance isn't preserved. Likewise, for orthogonal equivariance, if we perform an orthonormal change of basis for the initializations, spectral optimizers should give the same model but not coordinate-wise adaptive gradient methods. That being said, I could imagine reduced sensitivity to the choice of initializations and other hyperparameters for symmetry-compatible optimizers. We have some minimal base lr sweep for one of the pre-training experiments in the paper.
The current convergence theory only concerns the convergence of layerwise loss functions, so I think it is hard to comment on local optima avoidance or generalization properties at this point.
1/4
New paper with @weijie444!
We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update!
📝 https://t.co/M9z868by4q
💻 https://t.co/BCILN5V5mp
Really cool paper and it actually made me realize something.
Addressing the permutation invariance symmetry seems properly useful, and this feels like it drives even more complexity for attention specifically I’d imagine even the early phases of training.
While MLPs have per-neuron permutation invariance, attention heads are a bit more complex in that
- Permuting features within a given head is equivalent
- BUT the same permutation must be done on Q K and V
- then permuting heads themselves is fine, similar to observations around MoE experts
One thing I’d want to understand better is I imagine permutation of dimensions is pretty easily equivalent in descent? Though arbitrary rotations of basis might have tangible effects on how optimizers like Adam which consider elementwise dynamics work