ryan mathieu @gapdeepry - Twitter Profile

Pinned Tweet

3 months ago

Why does Muon beat Adam for training quantized networks? It comes down to what each optimizer treats as "distance" in weight space. Adam treats a weight matrix as a flat vector of numbers. Muon treats it as a linear map — and measures change by how much the input-output mapping moved. gradient G has SVD G = U Sigma V^T. Muon's update is just U V^T. keep the directions, throw away the magnitudes

gapDEEPry's tweet photo. Why does Muon beat Adam for training quantized networks?

It comes down to what each optimizer treats as "distance" in weight space.

Adam treats a weight matrix as a flat vector of numbers. Muon treats it as a linear map — and measures change by how much the input-output mapping moved.

gradient G has SVD G = U Sigma V^T. Muon's update is just U V^T. keep the directions, throw away the magnitudes

6

151

26

120

27K

ryan mathieu

@gapDEEPry

1 day ago

@eliebakouch Been waiting on this one

0

379

ryan mathieu

@gapDEEPry

4 days ago

@thsottiaux @ah20im I always get the worst timing with these resets lol

0

1

0

482

ryan mathieu

@gapDEEPry

16 days ago

@CiaraACade I get it but I don’t get it 😅

0

1

0

480

Who to follow

yoyo mind your business

gapDEEPry retweeted

about 1 month ago

Please check out Gradus, a micro-learning app! Gradus turns PhD-level information into digestible content. It imposes a structure on LLM outputs, yielding friendly curriculum, and democratizing eduction in the same way ChatGPT has. Check it out below! @sama Please notice me 😭 https://t.co/scRLpQ9B0m

3

24

4

2

1K

gapDEEPry retweeted

uuuvn @uuuvn_

about 1 month ago

@__tinygrad__ Llama.cpp is a very weak baseline. LMStudio I assume is mlx? It's better but I get 220 tok/s on Qwen3.5-0.8B-MLX-8bit with https://t.co/s6ovysiKqn on m1 max. The branch and command from that pr gets 145 on the same machine but the output is garbage

3

14

3

8

3K

gapDEEPry retweeted

elie

@eliebakouch

about 1 year ago

why the fuck does every optimization researcher on X have a cat/dog in their profile picture?

12

59

1

8

10K

ryan mathieu

@gapDEEPry

about 1 month ago

@eliebakouch I and my whole team have cat profiles internally as well … it must mean something

0

1

0

13

ryan mathieu

@gapDEEPry

about 1 month ago

@eliebakouch @TheZachMueller Being hopeful for release this week, I need a new model to max in codex

0

1

0

280

gapDEEPry retweeted

eugene @eugenebokhan

about 2 months ago

1/ Recently I've been obsessed with the idea of splitting the matmul computation on separate hardware units due to the tiling nature of the operation. My idea was simple: we have GPU, MXU, ANE, NEON, why are we utilizing only the first one?

1

14

2

3

3K