Mohammed Sabry @m___sabry - Twitter Profile

about 1 month ago

There is no pre-training, post-training, or test-time training. There are only priors, updates, constraints, and compute budgets. There is only TRAINING. Last several years we shipped the org chart to fundamental optimization science.

22

538

34

125

68K

Mohammed Sabry @M___Sabry

about 2 months ago

This feels like a new tension for the field: what capabilities should stay internal to the model, given that capacity is expensive, and what should be delegated to external modules. The older version of this tension was on inductive bias, and it helped lead us toward Transformers with less built-in structural bias. I want to add two points: * We were developing this thread of work as my PhD topic under the terms external modularity (modules outside the model that collaborate with the model) and internal modularity (circuits responsible for built-in capabilities). * The second point is that most people think of external modularity mostly in terms of tools (programmed modules), but it can involve neural modules too. A simple example is PEFT modules. hence one of my first works was developing a module-card so they could be included in such discussions: PEFT-Ref, https://t.co/rpn2aLxXIq

Turing Post

@TheTuringPost

about 2 months ago

https://t.co/Uz6vFz0bVN

0

9

1

3

1K

0

117

Mohammed Sabry @M___Sabry

2 months ago

I started this around two months ago, and I’d love to collaborate with the community on pushing it further. My broader view is that, if we want more robust synthetic data pipelines, we should rely more on explicit rubrics and correctness-by-construction checks. That is the motivation behind FORTGEN: a coverage-driven synthetic benchmark for Fortran modernization with executable semantic oracles. GitHub: https://t.co/Uddc6q61aL Slides: https://t.co/i3pWpjcAOQ

0

128

Mohammed Sabry @M___Sabry

3 months ago

@AliesTaha @gaoj0017 that depends on whether the methods are inherently hardware-specific. If TurboQuant is designed to be GPU-native then that's part of it's value...and if the paper doesn't attribute the gains to the algo alone then no problem with this comparison either.

0

1

0

1K

Who to follow

Amjed Alsadig 🇵🇸🇸🇩

@Amjed_Alsadig

Muslim || UofK || EEE015 'Confidence comes from preparation.' - Kobe

Moneeb

@Moneebss

I don't have dreams...I have goals ||ESFJ-A||

مـــروان

@Mrwaan_97

UofK | Sudan ❤️

Mohammed Sabry @M___Sabry

3 months ago

imo, it doesn’t seem like a big problem, the paper shows that larger D helps and makes the bottleneck less harmful. also, LLMs still train very well in practice, which may suggest that D is already large enough for language data (maybe because language is very low rank, so the effect isn’t as destructive as the norm numbers make it sound). But first I would like to know the evaluation of other proxies for gradient information loss than norm...norm does not necessarily correspond to useful signal.

0

1

0

963

Mohammed Sabry @M___Sabry

3 months ago

@elhllos Pretraining is only one part, most of the gpus/tpus (compute centers) of frontier labs are for inference workloads.

0

1

0

1K

M___Sabry retweeted

Andrew Gordon Wilson

@andrewgwils

3 months ago

In an era where many frontier labs are converging towards conservative incremental approaches, it's heartening to see resources being directed into ambitious efforts and fresh ideas!

0

141

6

9

13K

M___Sabry retweeted

Wonder of Science

@wonderofscience

3 months ago

How giraffes drink.

33

1K

116

74

125K

Mohammed Sabry @M___Sabry

6 months ago

The problem isn’t treating scaling curves like physics; it’s declaring regime changes without confidence intervals on the fit...

0

79

M___Sabry retweeted

Dinghuai Zhang 张鼎怀

@zdhnarsil

6 months ago

Continual learning has always been important, but people have not found effective ways. Tweaking loss objectives brings tons of ml papers a few year's back but cannot really achieve inf long context. Maybe we need more arch innovation like efficient attn and learnable memory.

7

161

12

78

20K

M___Sabry retweeted

signüll

@signulll

6 months ago

most ppl don’t really think. they rearrange cached thoughts until they feel smart. true thinking is rare because it’s metabolically expensive (just like thinking models are computationally expensive).

222

7K

585

2K

347K

M___Sabry retweeted

Andrew Gordon Wilson

@andrewgwils

6 months ago

Continual learning as a discipline seems to have catastrophic forgetting that it has been focused on catastrophic forgetting for a decade with virtually no progress. Time for some radically new ideas in that area.

25

361

23

103

29K

M___Sabry retweeted

Damek

@damekdavis

6 months ago

New paper studies when spectral gradient methods (e.g., Muon) help in deep learning: 1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank. 2. We then explain why spectral methods can perform well despite this. Long thread

damekdavis's tweet photo. New paper studies when spectral gradient methods (e.g., Muon) help in deep learning:

1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank.
2. We then explain why spectral methods can perform well despite this.

Long thread https://t.co/xEcpPvr32n

11

336

67

319

99K

M___Sabry retweeted

Yann LeCun

@ylecun

7 months ago

@giffmana Something something cake something something cherry.

12

257

2

28

30K

Mohammed Sabry @M___Sabry

7 months ago

Regardless of what direction the field takes, cutting-edge deep learning research will continue to require huge compute. The work is still fundamentally empirical: every promising idea demands careful ablations, stress-tests, and re-runs — even if experience, intuition, and taste help prune away many of the unnecessary experiments.

0

50

Mohammed Sabry @M___Sabry

7 months ago

Yes... big labs love scaling laws because they offer predictability, which makes strategic and financial planning much easier. Once an Org builds entire structures around the “scaling stack”: pre-training teams, post-training teams, eval teams, infra teams, it becomes harder to pivot toward radically different paradigms unless there’s a significant external shock.

Dwarkesh Patel

@dwarkesh_sp

7 months ago

The @ilyasut episode 0:00:00 – Explaining model jaggedness 0:09:39 - Emotions and value functions 0:18:49 – What are we scaling? 0:25:13 – Why humans generalize better than models 0:35:45 – Straight-shotting superintelligence 0:46:47 – SSI��s model will learn from deployment 0:55:07 – Alignment 1:18:13 – “We are squarely an age of research company” 1:29:23 – Self-play and multi-agent 1:32:42 – Research taste Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify. Enjoy!

403

9K

1K

8K

4M

1

0

85

M___Sabry retweeted

bidhan

@bidhan

7 months ago

anyone who is convinced that scaling era is over and want to sell me their gpus, pls dm

0

9

1

0

690

Mohammed Sabry @M___Sabry

7 months ago

"just improve information gain per FLOPs"

Jack Morris

@jxmnop

7 months ago

some hypotheses for what “better pretraining” could mean - integration with other training stages: i’m guessing they’re finally at a point where post-training perf (eg SWE-Bench) can be used as signal for pretraining eng decisions - filtering: scaling approaches like influence functions for getting rid of datapoints that don’t help eval perf - synthetic data: using rephrasing to upsample certain useful documents and make them more amenable to reasoning - mixing: more principled & scalable approaches for determining mixing coefficients - new data: purchasing and scanning more books, transcribing YouTube, buying private token collections like news articles - smart packing: there are various ways to group documents into batches that work better, especially for long-context stuff - systems: more data, more flops

14

308

18

201

69K

0

75

Mohammed Sabry @M___Sabry

7 months ago

Rage-baiting activated.... 🫢

0

66

Mohammed Sabry @M___Sabry

7 months ago

And it should be climbed during the training of a gaint NN

François Chollet

@fchollet

7 months ago

The ladder of intelligence is the ladder of abstraction. L1: Memorizing answers (no generalization) L2: Interpolative retrieval of answers, pattern matching, memorizing answer-generating rules (local generalization) L3: Synthesizing causal rules on the fly (strong generalization) L4: Discovering general principles, metacognition (extreme generalization) To achieve compounding AI you need to reach L4.

112

3K

357

2K

199K

1

0

96

Mohammed Sabry

@M___Sabry

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users