JS Denain

5 months ago

I rarely use this account. If you want to read more of my thoughts you can follow me at @datagenproc!

0

119

js_denain retweeted

6 months ago

Conventional wisdom says that the US can’t build power but China can, so China’s going to “win the AGI race by default”. We think this is wrong. The US likely can build enough power to support AI scaling through 2030 — as long as they’re willing to spend a lot. A thread:

EpochAIResearch's tweet photo. Conventional wisdom says that the US can’t build power but China can, so China’s going to “win the AGI race by default”.

We think this is wrong.

The US likely can build enough power to support AI scaling through 2030 — as long as they’re willing to spend a lot.

A thread: https://t.co/rjgd7f0wP7

2

182

42

92

56K

js_denain retweeted

9 months ago

Should AI regulations be based on training compute? As training pipelines become more complex, they could undermine compute-based AI policies. In a new piece with Google DeepMind’s AI Policy Perspectives team, we explain why. 🧵

EpochAIResearch's tweet photo. Should AI regulations be based on training compute?

As training pipelines become more complex, they could undermine compute-based AI policies.

In a new piece with Google DeepMind’s AI Policy Perspectives team, we explain why. 🧵 https://t.co/xIbHSXT053

8

63

11

26

8K

11 months ago

Very excited about this analysis by @GregHBurnham!

11 months ago

xAI commissioned us to analyze Grok 4’s math capabilities. Our findings: + It’s good at involved computations, improving at proofs (from a low base), and useful for literature search. - It favors low-level grinds and leans on background knowledge. Read on for examples!

EpochAIResearch's tweet photo. xAI commissioned us to analyze Grok 4’s math capabilities. Our findings:

+ It’s good at involved computations, improving at proofs (from a low base), and useful for literature search.

- It favors low-level grinds and leans on background knowledge.

Read on for examples! https://t.co/8KKf0Sypw3

21

478

39

114

190K

0

3

1

0

640

js_denain retweeted

11 months ago

How fast has society been adopting AI? Back in 2022, ChatGPT arguably became the fastest-growing consumer app ever, hitting 100M users in just 2 months. But the field of AI has transformed since then, and it’s time to take a new look at the numbers. 🧵

EpochAIResearch's tweet photo. How fast has society been adopting AI?

Back in 2022, ChatGPT arguably became the fastest-growing consumer app ever, hitting 100M users in just 2 months. But the field of AI has transformed since then, and it’s time to take a new look at the numbers. 🧵 https://t.co/EbsDqW1Is7

2

251

50

84

28K

js_denain retweeted

11 months ago

We are still hiring for an Engineering Lead on our Benchmarking team! We need a software engineer with outstanding technical expertise (no AI experience necessary) who's excited about leading evaluations on frontier AI models.

EpochAIResearch's tweet photo. We are still hiring for an Engineering Lead on our Benchmarking team! We need a software engineer with outstanding technical expertise (no AI experience necessary) who's excited about leading evaluations on frontier AI models. https://t.co/3b4YHZ00d3

1

13

1

2

3K

js_denain retweeted

11 months ago

Introducing FrontierMath Tier 4: a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities.

EpochAIResearch's tweet photo. Introducing FrontierMath Tier 4: a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities. https://t.co/ImY6hFDLGQ

17

561

62

114

82K

js_denain retweeted

11 months ago

Running SWE-bench evals is very slow and difficult. To solve this, we created a registry of optimized Docker images that let us run SWE-bench Verified in just one hour on a single 32-core machine. Today, we are open-sourcing these images— anyone can `docker pull` them.

EpochAIResearch's tweet photo. Running SWE-bench evals is very slow and difficult. To solve this, we created a registry of optimized Docker images that let us run SWE-bench Verified in just one hour on a single 32-core machine.

Today, we are open-sourcing these images— anyone can `docker pull` them. https://t.co/DIZJmIHpfb

3

202

13

50

11K

Tom Adamczewski @tmkadamcz

11 months ago

@tmkadamcz @EpochAIResearch Tagging @github

0

56

js_denain retweeted

11 months ago

The GitHub API doesn't seem to support changing the visibility of an image on the Container Registry. This is a huge problem for me as I have 4,219 images I need to make public for an @EpochAIResearch project :( Anyone at GitHub who could help with this?

2

5

2

0

886

js_denain retweeted

12 months ago

SWE-bench Verified is one of the main benchmarks to assess AI coding skills. But what does it actually measure? We found that it's one of the best tests of AI coding, but limited by its focus on simple bug fixes in familiar repositories. Here’s a summary of our article 🧵

EpochAIResearch's tweet photo. SWE-bench Verified is one of the main benchmarks to assess AI coding skills. But what does it actually measure?

We found that it's one of the best tests of AI coding, but limited by its focus on simple bug fixes in familiar repositories.

Here’s a summary of our article 🧵 https://t.co/mwXYuK7pZN

6

406

31

199

419K

js_denain retweeted

about 1 year ago

Three years and 100+ projects in, our mission is the same: give everyone clear, trusted insight into where AI is headed. Our new post unpacks the principles behind every research choice—why we take some ideas on and pass on others. https://t.co/2bkq9rkN1s

0

44

7

4

3K

about 1 year ago

@tomekkorbak My logs are kind of confusing to read though

0

2

0

48

about 1 year ago

@tomekkorbak Nice! Here are a few transcripts I generated yesterday (no scoring), showing similar things. https://t.co/0Xu2MX5tZp

2

6

0

233

about 1 year ago

@TeksEdge For more information: https://t.co/VHOi2Be0rk https://t.co/mXkDBfYhms

0

38

about 1 year ago

@TeksEdge Hi! The problems are all of the same form: you're given an image with ramps and buckets like this one, you have to predict which bucket the ball will fall into.

js_denain's tweet photo. @TeksEdge Hi! The problems are all of the same form: you're given an image with ramps and buckets like this one, you have to predict which bucket the ball will fall into. https://t.co/XmuFbGkjJF

2

1

0

63

js_denain retweeted

about 1 year ago

We’re hiring an Engineering Lead to help guide our Benchmarking team! Provide independent evaluations of today’s and tomorrow’s AI models, leading to better research, policy, and decision-making. The role is fully remote, and applications are rolling.

EpochAIResearch's tweet photo. We’re hiring an Engineering Lead to help guide our Benchmarking team! Provide independent evaluations of today’s and tomorrow’s AI models, leading to better research, policy, and decision-making. The role is fully remote, and applications are rolling. https://t.co/J30TJACBbz

15

171

25

30

346K

about 1 year ago

@scaling01 - OpenCompass - HHEM - Galileo Agent - XLANG Computer Agent Arena I haven't looked into them in detail yet, but will H/T you if we add some of them to the hub 🙂

0

1

0

45