mikail @Gradientdinner - Twitter Profile

Pinned Tweet

over 1 year ago

This paper got into @Nature!!! 🚀🚀🚀 Look at @SarthakChandra’s thread for a summary https://t.co/uOrPM3Stid

almost 3 years ago

🚨New Preprint! Wondered how grid cells form multiple discrete modules? Interested in continuous attractors and modularity? With @FieteGroup, we discover + generalize a physical mechanism for forming modules from smoothly varying parameters in a dynamical system!👇(1/15)

3

169

57

81

29K

4

107

14

25

13K

mikail @Gradientdinner

about 9 hours ago

@stochasticchasm 💀

0

1

0

13

mikail @Gradientdinner

4 days ago

@yauchungyiu Isn’t this like SNOO @vinaysrao

2

3

0

246

Gradientdinner retweeted

Zhengyang Geng

@ZhengyangGeng

5 days ago

couldn’t agree more my bias since day one: deep learning is absurdly flexible to succeed, if math/physics don’t forbid it, and we get *opt & data* right. it just works Parallax/Muon is one example; models with dynamics such as feedback loops yet another happening rn

2

74

4

51

12K

Who to follow

Andrew Saxe

@SaxeLab

Prof at @GatsbyUCL and @SWC_Neuro, trying to figure out how we learn. Bluesky: @SaxeLab Mastodon: @[email protected]

SueYeon Chung

@s_y_chung

assistant prof @harvardphysics @KempnerInst + proj lead @FlatironInst, trying to understand brains and neural networks w/ representation geometry & manifolds.

Daniel Yamins

@dyamins

CS, psych, and neuro prof @ Stanford. NeuroAI and "regular AI". Also harpsichords and bonsai. https://t.co/xCFbmgT6TG

Gradientdinner retweeted

kalomaze

@kalomaze

5 days ago

the broader implication is that there's abandoned architecture research from before Muon that failed because the empirical optimizers that worked in practice were, both literally and conceptually, stuck in element-wise local minima

9

324

26

152

40K

Gradientdinner retweeted

Yifei Zuo

@YifeiZuoX

6 days ago

For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones. paper: https://t.co/yAqClXrJUz code: https://t.co/D4pgIr1wM7 For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://t.co/85OzoOQlnF code: https://t.co/eqMYZ0U6qO

YifeiZuoX's tweet photo. For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention.

Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones.

paper: https://t.co/yAqClXrJUz
code: https://t.co/D4pgIr1wM7

For the origin of Parallax, check out the LLA paper at ICLR 2026:
paper: https://t.co/85OzoOQlnF
code: https://t.co/eqMYZ0U6qO

6

350

45

272

76K

mikail @Gradientdinner

18 days ago

@CoreAutoAI A hack to get good init for posttraining? 😆

0

3

0

241

mikail @Gradientdinner

22 days ago

@stochasticchasm Post-neurips arxiving? Coinciding with @kellerjordan0 optimizer track getting momentum?

1

0

144

mikail @Gradientdinner

23 days ago

@tianylin @mikechrzano

0

2

0

221

mikail @Gradientdinner

24 days ago

@_arohan_ @torchcompiled True (but figure 1 shows AdamW can’t be saved by good signal prop init) Pieces of the puzzle have been there before for sure — https://t.co/YOu6CBTe6M

0

1

0

58

Gradientdinner retweeted

Thinking Machines

@thinkymachines

24 days ago

With the model's simultaneous speech capability, Horace has gotten a lot easier to work with recently.

45

1K

61

245

273K

mikail @Gradientdinner

24 days ago

@_arohan_ @torchcompiled How many such architectural choices have been made because everyone AdamW’ed everything by default

1

0

1

64

mikail @Gradientdinner

24 days ago

@_arohan_ @torchcompiled 👀

1

0

174

mikail @Gradientdinner

24 days ago

@torchcompiled 404

1

0

51

mikail @Gradientdinner

27 days ago

@_arohan_ This is from @evaninwords

1

2

0

101

mikail @Gradientdinner

27 days ago

@_arohan_ Randomly re-initialize row/column of weight corresponding to dead neuron 🧠

1

2

0

108

mikail @Gradientdinner

about 1 month ago

@darshil Alfonso

0

1

0

117

Gradientdinner retweeted

Yuchen Jin

@Yuchenj_UW

about 1 month ago

No Neocloud ever imagined they’d be renting out H100s today at higher prices than 3 years ago. Even if you have money, frontier labs and Neolabs have already locked up most of the 2026 GPU supply. There is basically infinite demand for artificial intelligence.

32

427

16

70

54K

Gradientdinner retweeted

Pierfrancesco Beneventano

@PierBeneventano

about 1 month ago

Our new paper was accepted at ICML! 1) Momentum isn’t just “SGD but faster”. It affects sharpness (of orders of magnitude!) 2) The usual story says momentum lets you train in sharper regions. That’s true for large batches only! The opposite is true for minibatches!

PierBeneventano's tweet photo. Our new paper was accepted at ICML!

1) Momentum isn’t just “SGD but faster”.
It affects sharpness (of orders of magnitude!)

2) The usual story says momentum lets you train in sharper regions.
That’s true for large batches only! The opposite is true for minibatches! https://t.co/CUZhust4cM

3

112

14

76

8K

mikail @Gradientdinner

about 1 month ago

@Besteuler @fujikanaeda

0

1

0

205

Gradientdinner retweeted

Weiyang Liu

@Besteuler

4 months ago

Orthogonal Finetuning (https://t.co/IlBYlgiaae; https://t.co/Mve4Pdptmv) has a unique advantage of preventing catastrophic forgetting. Inspired by this property, we find that merging models within the orthogonal group can effectively reduce model conflicts and preserve both pretraining and downstream knowledge. This is our OrthoMerge framework. The idea behind OrthoMerge is extremely simple. For OFT-tuned models, we can first map the orthogonal adapters to Lie algebra with inverse Carley transform and then perform merging there. This guarantees the merged model differs from the pretrained model only up to an orthogonal transformation. A better news is that OrthoMerge can also be applied to non-OFT-tuned models. By solving the orthogonal procrustes problem, we can have the projected component of the adapter onto the orthogonal group. OrthoMerge will then be applied there and the residual component can be merged using conventional merging methods. That said, OrthoMerge can be used together with existing model merging methods! This is a great example of simple yet effective ideas. Great efforts by my PhD students Sihan Yang and Kexuan Shi. The project is already open-sourced and feel free to give it a try! Project: https://t.co/Fzjrn0zpaW Paper: https://t.co/QvFafN1UeY Code: https://t.co/LjEzcLZ0De

Besteuler's tweet photo. Orthogonal Finetuning (https://t.co/IlBYlgiaae; https://t.co/Mve4Pdptmv) has a unique advantage of preventing catastrophic forgetting. Inspired by this property, we find that merging models within the orthogonal group can effectively reduce model conflicts and preserve both pretraining and downstream knowledge. This is our OrthoMerge framework.

The idea behind OrthoMerge is extremely simple. For OFT-tuned models, we can first map the orthogonal adapters to Lie algebra with inverse Carley transform and then perform merging there. This guarantees the merged model differs from the pretrained model only up to an orthogonal transformation.

A better news is that OrthoMerge can also be applied to non-OFT-tuned models. By solving the orthogonal procrustes problem, we can have the projected component of the adapter onto the orthogonal group. OrthoMerge will then be applied there and the residual component can be merged using conventional merging methods. That said, OrthoMerge can be used together with existing model merging methods!

This is a great example of simple yet effective ideas. Great efforts by my PhD students Sihan Yang and Kexuan Shi. The project is already open-sourced and feel free to give it a try!

Project: https://t.co/Fzjrn0zpaW
Paper: https://t.co/QvFafN1UeY
Code: https://t.co/LjEzcLZ0De

7

383

57

351

50K

mikail

@Gradientdinner

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users