mikail @gradientdinner - Twitter Profile

Pinned Tweet

over 1 year ago

This paper got into @Nature!!! 🚀🚀🚀 Look at @SarthakChandra’s thread for a summary https://t.co/uOrPM3Stid

almost 3 years ago

🚨New Preprint! Wondered how grid cells form multiple discrete modules? Interested in continuous attractors and modularity? With @FieteGroup, we discover + generalize a physical mechanism for forming modules from smoothly varying parameters in a dynamical system!👇(1/15)

3

169

57

81

29K

4

107

14

25

13K

mikail @Gradientdinner

about 2 hours ago

@SolidlySheafy Do this for linear attention too!

1

3

0

41

mikail @Gradientdinner

1 day ago

@stochasticchasm 💀

0

1

0

13

mikail @Gradientdinner

5 days ago

@yauchungyiu Isn’t this like SNOO @vinaysrao

2

3

0

249

Who to follow

Andrew Saxe

@SaxeLab

Prof at @GatsbyUCL and @SWC_Neuro, trying to figure out how we learn. Bluesky: @SaxeLab Mastodon: @[email protected]

SueYeon Chung

@s_y_chung

assistant prof @harvardphysics @KempnerInst + proj lead @FlatironInst, trying to understand brains and neural networks w/ representation geometry & manifolds.

Tatiana Engel

@EngelTatiana

Computational neuroscientist @PrincetonNeuro deciphering natural and advancing artificial intelligence.

Gradientdinner retweeted

Zhengyang Geng

@ZhengyangGeng

5 days ago

couldn’t agree more my bias since day one: deep learning is absurdly flexible to succeed, if math/physics don’t forbid it, and we get *opt & data* right. it just works Parallax/Muon is one example; models with dynamics such as feedback loops yet another happening rn

2

74

4

51

12K

Gradientdinner retweeted

kalomaze

@kalomaze

6 days ago

the broader implication is that there's abandoned architecture research from before Muon that failed because the empirical optimizers that worked in practice were, both literally and conceptually, stuck in element-wise local minima

9

325

26

152

40K

Gradientdinner retweeted

Yifei Zuo

@YifeiZuoX

7 days ago

For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones. paper: https://t.co/yAqClXrJUz code: https://t.co/D4pgIr1wM7 For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://t.co/85OzoOQlnF code: https://t.co/eqMYZ0U6qO

YifeiZuoX's tweet photo. For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention.

Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones.

paper: https://t.co/yAqClXrJUz
code: https://t.co/D4pgIr1wM7

For the origin of Parallax, check out the LLA paper at ICLR 2026:
paper: https://t.co/85OzoOQlnF
code: https://t.co/eqMYZ0U6qO

6

351

45

273

77K

mikail @Gradientdinner

19 days ago

@CoreAutoAI A hack to get good init for posttraining? 😆

0

3

0

241

mikail @Gradientdinner

23 days ago

@stochasticchasm Post-neurips arxiving? Coinciding with @kellerjordan0 optimizer track getting momentum?

1

0

144

mikail @Gradientdinner

23 days ago

@tianylin @mikechrzano

0

2

0

221

mikail @Gradientdinner

25 days ago

@_arohan_ @torchcompiled True (but figure 1 shows AdamW can’t be saved by good signal prop init) Pieces of the puzzle have been there before for sure — https://t.co/YOu6CBTe6M

0

1

0

58

Gradientdinner retweeted

Thinking Machines

@thinkymachines

25 days ago

With the model's simultaneous speech capability, Horace has gotten a lot easier to work with recently.

45

1K

61

245

273K

mikail @Gradientdinner

25 days ago

@_arohan_ @torchcompiled How many such architectural choices have been made because everyone AdamW’ed everything by default

1

0

1

64

mikail @Gradientdinner

25 days ago

@_arohan_ @torchcompiled 👀

1

0

174

mikail @Gradientdinner

25 days ago

@torchcompiled 404

1

0

51

mikail @Gradientdinner

28 days ago

@_arohan_ This is from @evaninwords

1

2

0

101

mikail @Gradientdinner

28 days ago

@_arohan_ Randomly re-initialize row/column of weight corresponding to dead neuron 🧠

1

2

0

108

mikail @Gradientdinner

about 1 month ago

@darshil Alfonso

0

1

0

117

Gradientdinner retweeted

Yuchen Jin

@Yuchenj_UW

about 1 month ago

No Neocloud ever imagined they’d be renting out H100s today at higher prices than 3 years ago. Even if you have money, frontier labs and Neolabs have already locked up most of the 2026 GPU supply. There is basically infinite demand for artificial intelligence.

32

428

16

70

54K

Gradientdinner retweeted

Pierfrancesco Beneventano

@PierBeneventano

about 1 month ago

Our new paper was accepted at ICML! 1) Momentum isn’t just “SGD but faster”. It affects sharpness (of orders of magnitude!) 2) The usual story says momentum lets you train in sharper regions. That’s true for large batches only! The opposite is true for minibatches!

PierBeneventano's tweet photo. Our new paper was accepted at ICML!

1) Momentum isn’t just “SGD but faster”.
It affects sharpness (of orders of magnitude!)

2) The usual story says momentum lets you train in sharper regions.
That’s true for large batches only! The opposite is true for minibatches! https://t.co/CUZhust4cM

3

112

14

76

8K

mikail @Gradientdinner

about 1 month ago

@Besteuler @fujikanaeda

0

1

0

205

mikail

@Gradientdinner

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users