David Dohan @dmdohan - Twitter Profile

Pinned Tweet

almost 4 years ago

Happy to release our work on Language Model Cascades. Read on to learn how we can unify existing methods for interacting models (scratchpad/chain of thought, verifiers, tool-use, …) in the language of probabilistic programming. paper: https://t.co/olaE8mATYB

dmdohan's tweet photo. Happy to release our work on Language Model Cascades. Read on to learn how we can unify existing methods for interacting models (scratchpad/chain of thought, verifiers, tool-use, …) in the language of probabilistic programming.

paper: https://t.co/olaE8mATYB https://t.co/7W29ww7kzD

6

676

98

281

0

dmdohan retweeted

OpenAI

@OpenAI

11 months ago

We achieved gold medal-level performance 🥇on the 2025 International Mathematical Olympiad with a general-purpose reasoning LLM! Our model solved world-class math problems—at the level of top human contestants. A major milestone for AI and mathematics.

214

4K

430

420

673K

dmdohan retweeted

Nat McAleese

@__nmca__

11 months ago

I feel this may be helpful to some of you today:

13

712

62

138

89K

David Dohan

@dmdohan

11 months ago

Fun to watch prediction markets update on the news

1

14

0

1

2K

Who to follow

Jan Leike

@janleike

AI research @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.

Tri Dao

@tri_dao

Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.

Harrison Chase

@hwchase17

@LangChain Always hiring: https://t.co/D5Ut3loFO7

David Dohan

@dmdohan

11 months ago

OpenAI achieved gold medal on 2025 International Math Olympiad (solving 5 of 6 problems)! Thinks for hours and writes proofs in natural language. We've come a long way from LLMs solving 50% of MATH dataset in 2022 Congrats @alexwei_ on spearheading a major milestone!

Alexander Wei

@alexwei_

11 months ago

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

alexwei_'s tweet photo. 1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO). https://t.co/SG3k6EknaC

397

7K

1K

2K

6M

1

126

0

8

7K

David Dohan

@dmdohan

about 1 year ago

How to code a side project in 2025: 1. May 31 - Write project spec 2. Procrastinate 6 months 3. Dec 31 - ask favorite AI to implement it

2

64

3

5

4K

dmdohan retweeted

Noam Brown

@polynoamial

over 1 year ago

Scaling pretraining and scaling thinking are two different dimensions of improvement. They are complementary, not in competition.

49

1K

81

124

129K

dmdohan retweeted

Noam Brown

@polynoamial

over 1 year ago

This is on the scale of the Apollo Program and Manhattan Project when measured as a fraction of GDP. This kind of investment only happens when the science is carefully vetted and people believe it will succeed and be completely transformative. I agree it’s the right time.

246

7K

669

1K

918K

David Dohan

@dmdohan

over 1 year ago

@recurrented https://t.co/diHsY5Bf6Y

David Dohan

@dmdohan

almost 5 years ago

@mollyfmielke There's evidence for it: "In all cases, with exception of S9, they report having owned 1-of-3 toys widely sold by Fisher-Price between 1972 and 1989" Anecdotally, friend traces some # colors to license plate on family car. https://t.co/OiyyXKhdkJ study: https://t.co/aPjthgDW8R

dmdohan's tweet photo. @mollyfmielke There's evidence for it: "In all cases, with exception of S9, they report having owned 1-of-3 toys widely sold by Fisher-Price between 1972 and 1989"

Anecdotally, friend traces some # colors to license plate on family car.

https://t.co/OiyyXKhdkJ
study: https://t.co/aPjthgDW8R https://t.co/mDHrABd9ww

1

11

0

3

0

5

0

202

David Dohan

@dmdohan

over 1 year ago

@SteveMoraco same no more joking on the internet allowed

1

0

229

dmdohan retweeted

roon

@tszzl

over 1 year ago

🚨SCANDAL 🚨 OpenAI trained on the train set for the Millenium Puzzles

83

2K

27

90

143K

dmdohan retweeted

Steven Heidel

@stevenheidel

over 1 year ago

these new captchas are getting way too difficult

28

2K

95

78

77K

dmdohan retweeted

roon

@tszzl

over 1 year ago

o3 has literally made 0% progress on the Millennium eval it’s ai winter now

56

2K

28

148

192K

David Dohan

@dmdohan

over 1 year ago

@cHHillee @polynoamial @tamaybes Gotta look for the NP problems of P vs NP: easy to check, hard to do. Not sure what these look like in math outside formal theorem proving

0

5

0

316

David Dohan

@dmdohan

over 1 year ago

@polynoamial @tamaybes iiuc one of the constraints with FrontierMath is that the results are easy to check. Unless we do it with formal theorem proving, I’m not sure how to do that for unsolved problems Though maybe one tier should be unsolved hard to check ones too

5

15

0

1

3K

dmdohan retweeted

Liam Fedus

@LiamFedus

over 1 year ago

I have yet to find a well-defined task that cannot be optimized by these models. Eval improvement like ARC AGI showcase this dynamic

6

113

8

20

32K

David Dohan

@dmdohan

over 1 year ago

still a ways to go on FrontierMath!

Nat McAleese

@__nmca__

over 1 year ago

Lots of folks are posting quotes from Gowers/Tao about the hardest split of FrontierMath, but our 25% score is on the full set (which is also extremely hard, with old sota 2%, but not as hard as those quotes imply).

8

497

31

92

175K

1

14

0

3

3K

David Dohan

@dmdohan

over 1 year ago

@_xjdr @btc4me2 tbc it's a joke - literally meant it had been 16 hours since previous post & the o1->o3 jump is 32->87% https://t.co/dvMlgd5shm

David Dohan

@dmdohan

over 1 year ago

At this rate, how long til ARC-AGI is “solved”? For context: - gpt-4o @ 5% - Sonnet3.5 @ 14% - o1-preview @ 18% - o1 @ 32% - best scaffolded solution @ 54%

27

404

12

90

235K

2

18

0

1

2K

David Dohan

@dmdohan

over 1 year ago

Caveat on the Tao quote: that refers to the hardest "research" split of the dataset, while the 25% is across the entire dataset. https://t.co/hYZU5bDeZo

Jaime Sevilla

@Jsevillamol

over 1 year ago

@GarrisonLovely To clear a possible misunderstanding: the quotes refer to questions in the highest tier of difficulty of FrontierMath. Not every question in the benchmark is as difficult as the ones Tao and Gowers reviewed.

3

89

2

8

14K

2

46

2

6

7K

David Dohan

@dmdohan

over 1 year ago

imo the improvements on FrontierMath are even more impressive than ARG-AGI. Jump from 2% to 25% Terence Tao said the dataset should "resist AIs for several years at least" and "These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”

Nat McAleese

@__nmca__

over 1 year ago

Well, on FrontierMath 2024-11-26 o3 improves the state of the art from 2% to 25% accuracy. These are absurdly hard strongly held out math questions. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). (7/n)

__nmca__'s tweet photo. Well, on FrontierMath 2024-11-26 o3 improves the state of the art from 2% to 25% accuracy. These are absurdly hard strongly held out math questions. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). (7/n) https://t.co/U6fU7FRG9i

6

349

22

38

223K

20

878

72

216

153K

David Dohan

@dmdohan

over 1 year ago

FrontierMath details: https://t.co/FAh1ZnYJO9

1

35

0

4

8K

David Dohan

@dmdohan

over 1 year ago

@charles_irl https://t.co/oz90I13vWf

1

2

0

1

670

David Dohan

@dmdohan

over 1 year ago

We are used to the cadence of big model releases: GPT2->3->4 took two years each time We’re in a different world now o1 was announced months ago, now already on next generation Expect faster improvement going forward: o1 is like gpt2 if we could jump to gpt4 ~immediately

dmdohan's tweet photo. We are used to the cadence of big model releases: GPT2->3->4 took two years each time

We’re in a different world now
o1 was announced months ago, now already on next generation

Expect faster improvement going forward: o1 is like gpt2 if we could jump to gpt4 ~immediately

12

189

22

36

48K

David Dohan

@dmdohan

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users