Leechy @LLCMLR - Twitter Profile

LLCMLR retweeted

about 2 months ago

Marin is using quantile balancing from @Jianlin_S (who developed RoPE, which was also a good idea) to train our current 1e23 FLOPs MoE. The idea is elegant: assigning tokens to experts by solving a linear program. No hyperparameters to tune. Yields stable training.

4

331

34

240

84K

Leechy @LLCMLR

6 months ago

Vince Zampella redefined the modern FPS benchmark quality under unfavorable circumstances repeatedly, and is unquestionably a true legend in the entire gaming industry. Rest in peace.

Geoff Keighley

@geoffkeighley

6 months ago

I cannot believe I am writing this. Vince Zampella, a titan of the video game industry, the co-creator of Call of Duty and co-founder of Respawn Entertainment, not to mention a dear friend, died in a car crash yesterday in Los Angeles.

geoffkeighley's tweet photo. I cannot believe I am writing this.

Vince Zampella, a titan of the video game industry, the co-creator of Call of Duty and co-founder of Respawn Entertainment, not to mention a dear friend, died in a car crash yesterday in Los Angeles. https://t.co/jW3bT88gsE

1K

58K

5K

2K

5M

0

257

LLCMLR retweeted

Junda Chen @Junda_Chen_

6 months ago

Finally releasing our work CAD - a disaggregated approach to accelerate long context LLM training!

0

14

1

1K

Leechy @LLCMLR

7 months ago

Every big tech deserves its awakening like this. So dope.

Lucas Beyer (bl16)

@giffmana

7 months ago

Sergey going back into founder mode has had significant positive impact on GDM GenAI projects, because he can (and did) short-circuit all the BigCo bullshit researchers kept bumping into. Now the funny part which I didn't know, is that this was triggered by an OpenAI guy, Dan!

giffmana's tweet photo. Sergey going back into founder mode has had significant positive impact on GDM GenAI projects, because he can (and did) short-circuit all the BigCo bullshit researchers kept bumping into.

Now the funny part which I didn't know, is that this was triggered by an OpenAI guy, Dan! https://t.co/r0xBLeIAq0

29

2K

58

375

326K

0

116

Who to follow

Andrew Campbell

@AndrewC_ML

Research Scientist, Google DeepMind. Previous: @Xaira_Thera, PhD @oxcsml

YerevaNN

@YerevaNN

YerevaNN is a non-profit machine learning research lab based in Yerevan, Armenia

Katrina Drozdov (Evtimova)

@stochasticdoggo

Research Scientist @ValsAI | PhD from @NYUDataScience | Bulgarian yogurt, prime numbers, and dogs bring me joy | she/her

Leechy @LLCMLR

7 months ago

@youjiaxuan I think despite far from perfect, the way in which Huggingface tags are used by automation tools and resource maintainers presents a good near term hack. Objective metrics like model size, dataset size, topic tags (e.g. pretraining vs RL) can help bucket the expectations better.

0

2

1

0

229

LLCMLR retweeted

Elon Musk

@elonmusk

7 months ago

Diffusion will obviously work on any bitstream. With text, since humans read from first word to last, there is just the question of whether the delay to first sentence for diffusion is worth it. That said, the vast majority of AI workload will be video understanding and generation, so good chance diffusion is the biggest winner overall. Also means that the ratio of compute to memory bandwidth will increase.

128

2K

181

564

583K

LLCMLR retweeted

Rosinality @rosinality

7 months ago

The reverse KL → mode seeking, forward KL → mode covering principle does not necessarily work in LLM RL. Actually, it is possible for both KL regularizations to not promote low support samples.

rosinality's tweet photo. The reverse KL → mode seeking, forward KL → mode covering principle does not necessarily work in LLM RL. Actually, it is possible for both KL regularizations to not promote low support samples. https://t.co/oOzm0yBMKo

3

119

16

99

7K

Leechy @LLCMLR

8 months ago

@DimitrisPapail @JustinLin610 @SchmidhuberAI That OG world model paper from David Ha is so cool before the foundation models existed, the follow ups like PlaNet, Dreamer, SimPLe had a good run

0

1

0

118

Leechy @LLCMLR

8 months ago

@dejavucoder If the students are claiming that those classes are useless, those students are either too good or too bad. Either way they are in the wrong place. For the less equipped, these and the NYU ones are super insightful if you pay close attention to their message and think deeply.

0

1

0

123

LLCMLR retweeted

Simo Ryu

@cloneofsimo

8 months ago

I'm convinced that people with all the extremely bull take on RL after sutton's interview have never touched data directly, and simply don't know the importance of data. "Data is so very important" and "RL is the only way" is secretly mutually exclusive statement. Remember how many IMO winners google hired to label data to get IMO gold?

24

267

11

82

43K

LLCMLR retweeted

John Schulman

@johnschulman2

8 months ago

Really happy to see people reproducing the result that LoRA rank=1 closely matches full fine-tuning on many RL fine-tuning problems. Here are a couple nice ones: https://t.co/x7hcgNL3Bd https://t.co/5JyKuKd9wS

13

942

86

518

127K

Leechy @LLCMLR

9 months ago

@eliebakouch Orthogonality plays so many important roles in past optimization centered works. Coordinate descent, mean field variational bayes, vanilla gibbs sampling all had similar vibes like this, each motivated slightly differently ofc.

0

20

Leechy @LLCMLR

9 months ago

Iirc many ppl said this was one of the Qwen Coder API’s biggest issues

Aran Komatsuzaki

@arankomatsuzaki

9 months ago

Unfortunate reality: most open-source LLM servers (e.g. Together) don’t offer cache-hit discounts, while closed providers like OpenAI do. DeepSeek does discount, but most third-party servers don't. Closed models can end up much cheaper than open ones :(

24

237

12

58

33K

0

2

0

198

Leechy @LLCMLR

9 months ago

@kalomaze Feels like a better version of forward-forward algorithm in a good way 😁

0

40

LLCMLR retweeted

Frank Nielsen @FrnkNlsn

9 months ago

Lower bound on the variance of any unbiased estimator is greater or equal to the inverse of the Fisher information with equality holding iff the parametric distribution is an exponential family. Independently discovered by Fréchet, Darmois, Cramér and Rao https://t.co/PlTXV0z0vL

FrnkNlsn's tweet photo. Lower bound on the variance of any unbiased estimator is greater or equal to the inverse of the Fisher information with equality holding iff the parametric distribution is an exponential family.
Independently discovered by Fréchet, Darmois, Cramér and Rao

https://t.co/PlTXV0z0vL https://t.co/9XSC3H6Kcj

2

125

13

55

5K

LLCMLR retweeted

Frank Nielsen @FrnkNlsn

10 months ago

Jeffreys centroid minimizes the average symmetrized Kulback-Leibler divergence (SKLD) of a population. Centroid not in closed form... We propose Jeffreys-Fisher-Rao center as a proxy of Jeffreys centroid = Fisher-Rao midpoint of sided KL centroids https://t.co/zHQgrhWbu0

FrnkNlsn's tweet photo. Jeffreys centroid minimizes the average symmetrized Kulback-Leibler divergence (SKLD) of a population.

Centroid not in closed form...

We propose Jeffreys-Fisher-Rao center as a proxy of Jeffreys centroid = Fisher-Rao midpoint of sided KL centroids

https://t.co/zHQgrhWbu0 https://t.co/CMoDO7pxoE

3

103

17

49

5K

LLCMLR retweeted

Shubhendu Trivedi @_onionesque

10 months ago

Looking at the thread. The common frame to look at the more general phenomenon involves an eigenproblem of the form Oƒ = λƒ, where the operator O encodes either: a symmetry (translations, rotations, general group transformations), or a a statistic (e.g. covariance, correlation),

1

176

24

171

34K

Leechy @LLCMLR

10 months ago

Wow this is fascinating

leloy!

@leloykun

10 months ago

I've finally solved steepest descent on Finsler-structured (matrix) manifolds more generally. This generalizes work by me, @jxbz, and @Jianlin_S on Muon, Orthogonal Muon, & Stiefel Muon. --- The general solution turned out to be much simpler than I thought. And it should generalize to any combination of (underlying manifold, Finsler norm) and any number of extra constraints on the updates so long as the feasible set for each constraint is convex. --- I now consider this class of problems as sufficiently solved (by my definition of 'solved') and thus I'm moving on to other things I'm interested about.

leloykun's tweet photo. I've finally solved steepest descent on Finsler-structured (matrix) manifolds more generally. This generalizes work by me, @jxbz, and @Jianlin_S on Muon, Orthogonal Muon, & Stiefel Muon.

---

The general solution turned out to be much simpler than I thought. And it should generalize to any combination of (underlying manifold, Finsler norm) and any number of extra constraints on the updates so long as the feasible set for each constraint is convex.

---

I now consider this class of problems as sufficiently solved (by my definition of 'solved') and thus I'm moving on to other things I'm interested about.

11

491

51

391

82K

0

1

0

95

LLCMLR retweeted

Frank Nielsen @FrnkNlsn

10 months ago

Some differential-geometric concepts associated to an affine connection ∇ and a metric tensor g Figure from https://t.co/dTYvu2lRIu

FrnkNlsn's tweet photo. Some differential-geometric concepts associated to an affine connection ∇ and a metric tensor g

Figure from
https://t.co/dTYvu2lRIu https://t.co/CTabh9Bemo

2

193

23

121

11K

Leechy @LLCMLR

10 months ago

I kept doing this at every single company I stayed at and it really takes the right kind of audience to truly appreciate its firepower at just right moments. An incredible skill I wish most people would have, surprisingly quite rare in the SV bubble.

a16z @a16z

10 months ago

.@pmarca: "The person who writes down the thing has tremendous power." In most companies, almost no one does it. If you can turn chaos into a coherent plan on paper, people will follow your lead, whether you have the title or not.

109

4K

383

3K

594K

0

89

Leechy

@LLCMLR

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users