Tim Lau @timlautk - Twitter Profile

Pinned Tweet

20 days ago

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 https://t.co/M9z868by4q 💻 https://t.co/BCILN5V5mp

4

140

31

120

31K

Tim Lau

@timlautk

1 day ago

@michaelchchoi Those stores were closed many years ago though…

0

1

0

65

Tim Lau

@timlautk

2 days ago

RIP Prof. Dimitri Bertsekas. His textbooks on optimization and RL have been the standard reference and inspiration for many years for optimization researchers like me.

timlautk's tweet photo. RIP Prof. Dimitri Bertsekas. His textbooks on optimization and RL have been the standard reference and inspiration for many years for optimization researchers like me. https://t.co/lVPWsWubA0

1

61

5

24

3K

Tim Lau

@timlautk

2 days ago

Another exciting piece of work on architecture--optimizer co-design.

Tilde

@tilderesearch

3 days ago

Introducing Compositional Muon, an optimizer that extends Muon from individual matrices to composed transformer circuits. Modern optimizers usually draw trust regions around individual parameters. But in attention, the loss often sees compositions like QK^T and OV. Updating each factor independently can therefore control the wrong object. Compositional Muon closes this gap by deriving partner-whitened update rules. Each factor’s update is shaped by the spectral geometry of the matrix it is composed with, producing more stable composed updates and better effective learning-rate allocation across heads and layers. For QK, this gives a head-local half-split rule. For OV, the circuit geometry selects a hybrid rule: (V) is optimized per-head, while (W_O) is optimized as the single matrix that aggregates all heads back into the residual stream. CM improves over Muon at 340M and 1B scale, transfers to the modded-nanoGPT optimization benchmark, and can be approximated cheaply as partner-rescaled Muon via the isotropic rule. The broader point is optimizer-architecture co-design: better optimizers should not only ask how to update a parameter, but what composed circuit that parameter participates in. CM is one step toward optimizers that respect the functional structure the loss actually sees.

8

399

53

340

85K

0

4

0

428

Who to follow

Michael Choi

@michaelchchoi

Assistant Professor @NUSingapore. Applied probabilist. Probability, MCMC, statistical physics, optimization, information theory, TCS. Opinions my own.

Shuvomoy Das Gupta

@shuvo_das_gupta

Assistant Professor @RiceUniversity CMOR PhD @MIT @ORCenter Optimization • Machine Learning • Statistics • Transportation

Shiqian Ma

@ShiqianMa

Professor@Rice University. AI, Machine Learning and Optimization

Tim Lau

@timlautk

5 days ago

6/6 This also answers @CsabaSzepesvari's earlier question: is there an easy-to-explain benefit beyond "it just works"? Yes. When a symmetry is redundant, an unaware optimizer still drifts along it for nothing. Respecting it removes that drift, which shows up as controlled final-logit growth, not only lower loss: https://t.co/OiPpRsKzyw

0

3

1

0

123

Tim Lau

@timlautk

5 days ago

1/6 🧵 New revision of our paper on symmetry-compatible optimizer design (w/ @weijie444). The update sharpens one idea: softmax shift-invariance in LM heads (and MoE routers) isn't a footnote. It's a quotient geometry the optimizer should respect. And we back it with a new logit-control study. 📄 https://t.co/8o7STnPLAs (v2 only had citation updates and part of additional empirical results)

Tim Lau

@timlautk

20 days ago

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 https://t.co/M9z868by4q 💻 https://t.co/BCILN5V5mp

4

140

31

120

31K

1

19

1

8

2K

Tim Lau

@timlautk

5 days ago

@weijie444 5/6 Takeaway: take layerwise symmetry seriously and softmax shift-invariance becomes a design constraint, not a curiosity. The same quotient geometry applies to MoE routers (Section G.4.1), but the LM head logit story is the clearest illustration.

1

3

1

0

171

Tim Lau

@timlautk

6 days ago

@QuanquanGu All the best! Exciting journey ahead!

1

5

0

2K

Tim Lau

@timlautk

7 days ago

@zhaoran_wang Thanks so much!

0

48

Tim Lau

@timlautk

9 days ago

@weijie444 Congrats on both!! 🎉 My privilege to be part of the journey these past two years!

0

3

0

3K

timlautk retweeted

Weijie Su

@weijie444

9 days ago

Personal update: I've joined OpenAI while on leave from Wharton. After a decade away, glad to be back in the Bay Area and train AI models here! One more thing, I've been promoted to full professor, a decade-long journey made possible by many, especially my students.

weijie444's tweet photo. Personal update: I've joined OpenAI while on leave from Wharton. After a decade away, glad to be back in the Bay Area and train AI models here!

One more thing, I've been promoted to full professor, a decade-long journey made possible by many, especially my students. https://t.co/lSbCABlwlY

96

2K

100

224

628K

Tim Lau

@timlautk

12 days ago

Exactly. If you've got the driving direction wrong, no amount of tuning the speed, the gear-shift schedule, or bolting accessories onto the car (read: learning rate, LR schedules, momentum, training-objective tweaks, architectural patches) fixes the underlying problem. You'll still arrive at the wrong place, just with different symptoms each time. A lot of the vectorized-optimizer tuning era reads this way in retrospect. Which is the thesis behind our symmetry-compatible optimizer work (https://t.co/M9z868by4q): the update rule should be equivariant under the symmetry group acting on each weight block, derived per block type. We work this out end-to-end as a layerwise optimizer stack covering embeddings, LM heads, SwiGLU MLPs, and MoE routers, i.e. the whole modern transformer/MoE, not a single block in isolation. Your second point falls out of this. Once update direction respects symmetry, architecture design escapes the same flattening assumption, and the two can be co-designed.

Tarun Kathuria

@_TarunKathuria

12 days ago

@plugyawn Hmm maybe but it feels like classical vectorization based optimizers are just fundamentally the wrong way of looking at deep learning. Not just in optimizer design but even in the design of architectures themselves…

1

3

0

2

3K

1

14

2

12

3K

timlautk retweeted

The Shaw Prize @ShawPrize

12 days ago

The #ShawPrize in #MathematicalSciences 2026 is awarded in equal shares to Emmanuel Candès and Camillo De Lellis @Stanford @the_IAS #Shawlaureates2026

ShawPrize's tweet photo. The #ShawPrize in #MathematicalSciences 2026 is awarded in equal shares to

Emmanuel Candès and Camillo De Lellis

@Stanford @the_IAS
#Shawlaureates2026 https://t.co/H1yxh5HXqX

1

44

12

3

33K

Tim Lau

@timlautk

17 days ago

@maximelabonne What if you also need another optimizer for the embedding but not Adam? See https://t.co/PdFMOT0Wqg

Tim Lau

@timlautk

20 days ago

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 https://t.co/M9z868by4q 💻 https://t.co/BCILN5V5mp

4

140

31

120

31K

0

5

1

0

814

Tim Lau

@timlautk

18 days ago

Some additional takes on the "obvious benefits" of symmetry-compatible optimizers

Tim Lau

@timlautk

18 days ago

Thanks so much for your interest! Yes! If the update map does not preserve the equivariance, then the optimizer iterates depend on arbitrary coordinate choices: rotating the input space (orthogonal equivariance) or permutating the output space (permutation equivariance) can change the optimizer itself, leading to different training dynamics under equivalent reparameterizations. Changing the choices of coordinates at each iterate is undesirable, likely leading to slower convergence and potentially unstable training, especially when the dimensions of matrices grow. Definitely, it is leading to speedup (see Muon vs Adam for an instance of bi-orthogonal equivariance, and embeddings and LM heads for LPRO equivariance in our paper). There are also implications for initializations/datasets. For instance, for input embeddings and LM heads, if we perform a vocab re-indexing in the tokenizer and re-tokenize the data, we should still get the same model trained on the new tokenized dataset if the optimizer updates preserve the permutation equivariance in the vocab dimension. However, It is not the case if permutation equivariance isn't preserved. Likewise, for orthogonal equivariance, if we perform an orthonormal change of basis for the initializations, spectral optimizers should give the same model but not coordinate-wise adaptive gradient methods. That being said, I could imagine reduced sensitivity to the choice of initializations and other hyperparameters for symmetry-compatible optimizers. We have some minimal base lr sweep for one of the pre-training experiments in the paper. The current convergence theory only concerns the convergence of layerwise loss functions, so I think it is hard to comment on local optima avoidance or generalization properties at this point.

0

6

0

1

2K

0

7

1

5

2K

Tim Lau

@timlautk

18 days ago

Thanks so much for your interest! Yes! If the update map does not preserve the equivariance, then the optimizer iterates depend on arbitrary coordinate choices: rotating the input space (orthogonal equivariance) or permutating the output space (permutation equivariance) can change the optimizer itself, leading to different training dynamics under equivalent reparameterizations. Changing the choices of coordinates at each iterate is undesirable, likely leading to slower convergence and potentially unstable training, especially when the dimensions of matrices grow. Definitely, it is leading to speedup (see Muon vs Adam for an instance of bi-orthogonal equivariance, and embeddings and LM heads for LPRO equivariance in our paper). There are also implications for initializations/datasets. For instance, for input embeddings and LM heads, if we perform a vocab re-indexing in the tokenizer and re-tokenize the data, we should still get the same model trained on the new tokenized dataset if the optimizer updates preserve the permutation equivariance in the vocab dimension. However, It is not the case if permutation equivariance isn't preserved. Likewise, for orthogonal equivariance, if we perform an orthonormal change of basis for the initializations, spectral optimizers should give the same model but not coordinate-wise adaptive gradient methods. That being said, I could imagine reduced sensitivity to the choice of initializations and other hyperparameters for symmetry-compatible optimizers. We have some minimal base lr sweep for one of the pre-training experiments in the paper. The current convergence theory only concerns the convergence of layerwise loss functions, so I think it is hard to comment on local optima avoidance or generalization properties at this point.

0

6

0

1

2K

Tim Lau

@timlautk

20 days ago

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 https://t.co/M9z868by4q 💻 https://t.co/BCILN5V5mp

4

140

31

120

31K

timlautk retweeted

Ethan

@torchcompiled

19 days ago

Really cool paper and it actually made me realize something. Addressing the permutation invariance symmetry seems properly useful, and this feels like it drives even more complexity for attention specifically I’d imagine even the early phases of training. While MLPs have per-neuron permutation invariance, attention heads are a bit more complex in that - Permuting features within a given head is equivalent - BUT the same permutation must be done on Q K and V - then permuting heads themselves is fine, similar to observations around MoE experts One thing I’d want to understand better is I imagine permutation of dimensions is pretty easily equivalent in descent? Though arbitrary rotations of basis might have tangible effects on how optimizers like Adam which consider elementwise dynamics work

torchcompiled's tweet photo. Really cool paper and it actually made me realize something.

Addressing the permutation invariance symmetry seems properly useful, and this feels like it drives even more complexity for attention specifically I’d imagine even the early phases of training.

While MLPs have per-neuron permutation invariance, attention heads are a bit more complex in that
- Permuting features within a given head is equivalent
- BUT the same permutation must be done on Q K and V
- then permuting heads themselves is fine, similar to observations around MoE experts

One thing I’d want to understand better is I imagine permutation of dimensions is pretty easily equivalent in descent? Though arbitrary rotations of basis might have tangible effects on how optimizers like Adam which consider elementwise dynamics work

3

37

7

32

5K

Tim Lau

@timlautk

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users