David Hall @dlwh - Twitter Profile

dlwh retweeted

about 5 hours ago

Quoting @dlwh : we are at risk of losing the reputation of spiky loss runs! This run incorporates some stability techniques from my past projects: Hyperball, Gated Norm, and Gated Attention. Excited to see the next run from Marin!

wen_kaiyue's tweet photo. Quoting @dlwh : we are at risk of losing the reputation of spiky loss runs!

This run incorporates some stability techniques from my past projects: Hyperball, Gated Norm, and Gated Attention. Excited to see the next run from Marin! https://t.co/LJ0jSyOG2O

4

72

6

34

6K

dlwh retweeted

Larry Dial

@classiclarryd

about 6 hours ago

Building momentum at Marin! Upgrading from Dense -> 129B parameter MoEs -> architecture improvements -> optimizer improvements gives our pretraining recipe an estimated 6x cumulative learning speedup, accounting for MFU. Includes community contributions. https://t.co/5dPB9uBiSp

1

95

12

44

15K

David Hall @dlwh

7 days ago

@WilliamBarrHeld @ZhengxuanZenWu @jiaxinwen22 Might be fun to run the whole suite through to see what the isoflop shapes are

1

0

64

dlwh retweeted

Percy Liang

@percyliang

10 days ago

Not only do we want to train a good model, we want to know it'll be good before we even start training. About a month ago, the Marin team launched a 129B (16B active) 1e23 FLOPs MoE run and preregistered a loss of 2.252. The run finished this past week and landed at 2.234. https://t.co/OptaVa7jIO

percyliang's tweet photo. Not only do we want to train a good model, we want to know it'll be good before we even start training.

About a month ago, the Marin team launched a 129B (16B active) 1e23 FLOPs MoE run and preregistered a loss of 2.252. The run finished this past week and landed at 2.234.
https://t.co/OptaVa7jIO

25

613

67

269

61K

Who to follow

Christopher Potts

@ChrisGPotts

Stanford Professor of Linguistics and, by courtesy, of Computer Science. Member of technical staff @stanfordnlp and @StanfordAILab. Co-founder @ Bigspin AI.

Jacob Andreas

@jacobandreas

Teaching computers to read. Assoc. prof @MITEECS / @MIT_CSAIL / @NLP_MIT (he/him). https://t.co/5kCnXHjtlY https://t.co/2A3qF5vdJw

Piotr Nawrot

@p_nawrot

LLM Efficiency @NVIDIA - views have always been only my own 🥇🥈 @ Flunkyball Polish Championships

dlwh retweeted

Percy Liang

@percyliang

21 days ago

For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So if you are sitting on some secret stash of high quality tokens, please let us know! Pre-training, mid-training, SFT data all welcome.

percyliang's tweet photo. For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So if you are sitting on some secret stash of high quality tokens, please let us know! Pre-training, mid-training, SFT data all welcome. https://t.co/49DBdzvYXE

16

262

20

71

23K

dlwh retweeted

Kevin Li

@kevin_x_li

21 days ago

Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest. 112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages https://t.co/aVqCc4J5tr

19

523

67

399

79K

David Hall @dlwh

23 days ago

Also, Will is underselling the blog post. The interactive figures are excellent: they make the scaling intuition concrete, including what transfers and what breaks. Worth reading https://t.co/Dh5EajzrDr

0

185

David Hall @dlwh

23 days ago

Marin’s Delphi scaling suite is out! With the right scaling recipe, small runs predicted a 1e23 FLOP run within 0.2%, extrapolating 300× past the largest run in the fit.

Will Held @WilliamBarrHeld

23 days ago

To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models with one recipe, then extrapolated 300× to predict a 25B-param / 600B-token run with just 0.2% error. Getting there took some work 🧵

14

456

78

326

137K

1

24

3

6

3K

David Hall @dlwh

23 days ago

Delphi changes how we evaluate new ideas: start small, sweep the param/token tradeoff, scale the key hypers, compare against forecasts, repeat.

1

0

213

David Hall @dlwh

28 days ago

a year later, and codex/claude are in fact the only reliable way to get nccl working afaict

Simo Ryu

@cloneofsimo

about 1 year ago

> be jensen > ships cuda and nccl > doesnt work, takes AGI to reinstall cuda driver > "wish you pain and suffering"

1

37

2

4

7K

1

19

1

2

4K

David Hall @dlwh

about 1 month ago

Erfan continuing to do amazing things. Pipeline parallelism is historically pretty ugly in JAX but he made it so nice

Erfanzar

@eraznafre

about 1 month ago

Releasing SpectraX is a JAX-native neural-network library built around true MPMD pipeline parallelism. Each physical rank compiles and runs its own XLA program — no shared shard_map HLO, no SPMD-same-shape constraint. Heterogeneous stages (eg, embed → blocks → head), nine pipeline schedules (GPipe, 1F1B, ZeroBubble, Interleaved, DualPipeV, …), and a unified https://t.co/vYOljO1K4k()/spx.jit() entry point that dispatches to SPMD or MPMD from the same training script. https://t.co/GWPCsQVUwI

6

160

18

126

37K

0

30

3

12

8K

dlwh retweeted

Tim Dettmers

@Tim_Dettmers

about 2 months ago

So cool to see that open-source, with open experimentation (and with the help of someone posting blog posts about their personal research), can yield a very robust method for MoE balancing. This method seems more elegant than all other methods I have seen. Open source is Awesome!

3

80

9

32

19K

David Hall @dlwh

about 2 months ago

Super cool work from Larry for Marin's MoE work! Quantile Balancing seemed to basically Just Work.

Larry Dial

@classiclarryd

about 2 months ago

Researchers' brilliant ideas often get lost in the sea of endless SOTA claims on weak baselines. At Marin we battle-test ideas in an open arena, where anyone's idea can be promoted to the next hero run. One that recently rose up was @Jianlin_S MoE Quantile Balancing, used in our last 1e22 and ongoing 130B run. Animated visuals of how QB performed are available in the OpenAthena blog. https://t.co/BDSsonuNH7

classiclarryd's tweet photo. Researchers' brilliant ideas often get lost in the sea of endless SOTA claims on weak baselines. At Marin we battle-test ideas in an open arena, where anyone's idea can be promoted to the next hero run. One that recently rose up was @Jianlin_S MoE Quantile Balancing, used in our last 1e22 and ongoing 130B run. Animated visuals of how QB performed are available in the OpenAthena blog. https://t.co/BDSsonuNH7

9

240

30

144

80K

0

13

0

2

2K

David Hall @dlwh

2 months ago

Lots more to do! MoEs are next, and we think the curve bending and loss spikes at the higher scales may be related (even though the lower scale runs were spike free)

0

4

0

265

David Hall @dlwh

2 months ago

Marin Delphi is finished! Will et al produced a stable scaling formula that yielded results almost exactly in line with predictions, at 1% the flop budget (and we suspect we can do somewhat less now)

Will Held @WilliamBarrHeld

2 months ago

How far do Marin's scaling laws extrapolate? At least 100x, apparently! Despite spooky spikes, our 1e23 Delphi finished on forecast. The compute-optimal ladder costs ~1e21 FLOPs to train. Good scaling science lets you “run” this (not tiny) experiment at 1/100th the cost.

WilliamBarrHeld's tweet photo. How far do Marin's scaling laws extrapolate? At least 100x, apparently!

Despite spooky spikes, our 1e23 Delphi finished on forecast. The compute-optimal ladder costs ~1e21 FLOPs to train. Good scaling science lets you “run” this (not tiny) experiment at 1/100th the cost. https://t.co/nRJma4sunw

3

148

19

67

53K

1

15

0

7

2K

David Hall @dlwh

2 months ago

@PrismML congrats!

2

5

1

0

982

David Hall @dlwh

2 months ago

@AhmedSQRD @cjmaddison @marikgoldstein I’ve given up. Codex and Claude don’t care. Beautiful idea for a different world.

1

2

0

48

David Hall @dlwh

2 months ago

@MatharyCharles this one has qk-norm and adamh (hyperball constraint on weight norms). I think the LR is just too hot (and some data issues). no z-loss I think

1

0

69

David Hall @dlwh

2 months ago

Spikes again, but this time we can't can't intervene for Science. Despite that, seems to be ~on track!

Will Held @WilliamBarrHeld

2 months ago

Our 1e23 "Delphi" (~25B param model trained for ~600B tokens) run for Marin has entered its learning rate decay phase. Lots of spikes at this scale, very scary! Despite that, the run is looking on track to be close to our pre-registered scaling laws predictions. Stay tuned...

WilliamBarrHeld's tweet photo. Our 1e23 "Delphi" (~25B param model trained for ~600B tokens) run for Marin has entered its learning rate decay phase.

Lots of spikes at this scale, very scary! Despite that, the run is looking on track to be close to our pre-registered scaling laws predictions. Stay tuned... https://t.co/59abhVctu9

6

117

11

67

47K

2

13

0

2

2K

David Hall @dlwh

2 months ago

@WilliamBarrHeld has pointed out that we started hitting spikes with the 1e23 at around the same loss value as where 1e22 saw some spikes. Probably the WSD LR stays too hot too long.

0

1

0

178

David Hall

@dlwh

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users