Joe Davison @joeddav - Twitter Profile

joeddav retweeted

1 day ago

Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works. I asked him if I could record it on my iPhone. The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory. So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made. Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required. The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.

36

2K

157

3K

331K

joeddav retweeted

Fei-Fei Li

@drfeifei

1 day ago

https://t.co/Kt50ttQRMJ

132

4K

787

5K

695K

Joe Davison

@joeddav

2 days ago

@julien_c people actually use screen? 👀

0

292

Joe Davison

@joeddav

2 days ago

@AradhyeAgarwal Very cool, congrats!

1

2

0

841

Who to follow

Lysandre

@LysandreJik

Chief Open-Source Officer (COSO) at Hugging Face

Sylvain Gugger

@GuggerSylvain

Machine Learning at Jane Street. Previously at @huggingface and @fastdotai Co-author of https://t.co/lywnOAwwnc He/him

Mikel Artetxe

@artetxem

Co-founder @RekaAILabs and Honorary Researcher @Hitz_zentroa (University of the Basque Country) | Past: Research Scientist @AIatMeta (FAIR)

joeddav retweeted

Niels Rogge @NielsRogge

3 days ago

What is mid-training? The stage between pre-training and post-training A base model is continued on a smaller, curated data mixture chosen to strengthen capabilities that the original pre-training run undercovered, such as multilinguality, domain knowledge, or long-context extension. It usually keeps a pre-training-like objective, but uses higher-quality or more targeted data so later instruction tuning, preference tuning, or RL can shape behavior on top of stronger capabilities. Learn more here: https://t.co/WhpYkyGlv8

NielsRogge's tweet photo. What is mid-training?

The stage between pre-training and post-training

A base model is continued on a smaller, curated data mixture chosen to strengthen capabilities that the original pre-training run undercovered, such as multilinguality, domain knowledge, or long-context extension.

It usually keeps a pre-training-like objective, but uses higher-quality or more targeted data so later instruction tuning, preference tuning, or RL can shape behavior on top of stronger capabilities.

Learn more here: https://t.co/WhpYkyGlv8

6

446

55

441

32K

Joe Davison

@joeddav

3 days ago

@soldni Dang what a flex

0

1

0

303

Joe Davison

@joeddav

3 days ago

@natolambert @allen_ai Rough day for Ai2 I’m sure, but excited to your next steps!

0

410

joeddav retweeted

will brown

@willccbb

6 days ago

@martin_casado there are two large, capable, and well-resourced entities with clear strategic interests in ensuring open models keep up: China and Nvidia preventing distillation and capturing market share are in tension. it'll be hard to distill GPT-7-BioChem, easy to distill Default Claude.

11

311

10

50

14K

Joe Davison

@joeddav

5 days ago

@willccbb Muddles the narrative if you’re trying to prove your recipe scales tho

0

3

0

611

Joe Davison

@joeddav

6 days ago

@francoisfleuret

0

1

0

340

Joe Davison

@joeddav

6 days ago

@Miles_Brundage Four months actually seems like… a decent compromise between open and proprietary? Big labs get to keep saying they have the best models at any given time, but four months is a small enough gap that it’s still very reasonable to keep investing in open weights as well

0

1

0

318

Joe Davison

@joeddav

6 days ago

@Miles_Brundage You mean this?

0

61

0

7K

Joe Davison

@joeddav

7 days ago

@lateinteraction The real cv is the friends we make along the way

0

1

0

426

Joe Davison

@joeddav

8 days ago

@willccbb @Jason Yay @arcee_ai

0

3

0

465

joeddav retweeted

will depue

@willdepue

8 days ago

wow. seems like a big deal happy birthday gpt-3

9

370

16

41

44K

Joe Davison

@joeddav

8 days ago

@yacineMTB i would never use a tensor framework bro, cuda is all you need bro, do you even write kernels bro

0

1K

Joe Davison

@joeddav

8 days ago

yeah no it’s good work to be sure, but I want to think soberly about the difference between optimizing a specific training pipeline vs. an order-of-magnitude performance leap in general purpose tooling the former is good engineering work (that I assume all the big labs are doing to some extent), the latter would be a breakthrough… but I don’t think this is the latter

1

0

27

Joe Davison

@joeddav

8 days ago

The implied win here is “our C framework is 10x faster than JAX.” Surely the actual win is good kernel work optimized for their specific cluster configuration? Perf folks: what would JAX/XLA fundamentally fail to express here?

Elon Musk

@elonmusk

8 days ago

SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.

7K

98K

11K

7K

30M

1

0

354

Joe Davison

@joeddav

8 days ago

@maharshii Is it really a general framework tho? or just a specific training pipeline with good kernels optimized for a specific cluster configuration?

0

1

0

888

joeddav retweeted