Alexander Barry

Verified account

@AlexBarry4

Statistician working on understanding AI Capabilities | Epoch AI Substack:

Joined January 2011

26 Following

208 Followers

68 Posts

Pinned Tweet

Alexander Barry

about 2 months ago

Interesting to work on this report with Epoch. We found that AI progress speeds have been accelerating since ~mid 2024 (on 3/4 of the metrics we considered). Treating reasoning models as a trendbreak made the best predictions, but not enough data to be very confident.

@EpochAIResearch

about 2 months ago

Have AI capabilities accelerated? On 3 out of the 4 AI capability metrics we investigated, we found strong evidence of acceleration, around when reasoning models emerged.

EpochAIResearch's tweet photo. Have AI capabilities accelerated?

On 3 out of the 4 AI capability metrics we investigated, we found strong evidence of acceleration, around when reasoning models emerged. https://t.co/YHnsG7UbmP

9

428

54

98

52K

1

36

4

7

4K

Alexander Barry

1 day ago

Yeah the predictions seem pretty crazy here. I expect people were just unaware of what the current values are? Vs strongly predicting a slowdown. Looking at the questions they did specify it was asking directly about METR time horizon, but didn't give the current values or trends.

0

4

0

1

209

Alexander Barry

5 days ago

@htihle Very cool! 90% must be getting pretty close to the theoretical max score (iirc about 96%?)

0

1

0

0

189

Alexander Barry

7 days ago

My full post: https://t.co/SxQeeN1QY4

0

8

0

1

308

Alexander Barry

7 days ago

Continuing with tradition I used Opus 4.8's AECI values to predict its METR time horizon: estimated 50% time horizon of 20.0 hours, 80% time horizon of 2.8 hours. See more details (including why METR's early Mythos Preview results have been misinterpreted) in my post below

AlexBarry4's tweet photo. Continuing with tradition I used Opus 4.8's AECI values to predict its METR time horizon: estimated 50% time horizon of 20.0 hours, 80% time horizon of 2.8 hours.

See more details (including why METR's early Mythos Preview results have been misinterpreted) in my post below https://t.co/u9WlcIb4aL

1

54

6

9

4K

Alexander Barry

10 days ago

Chance to (indirectly) tell me what to do by telling us what Epoch outputs you find most valuable

@EpochAIResearch

10 days ago

Help us produce the most useful work on AI by taking our 5-minute survey: https://t.co/W2tLu3e4WW (You can sign up at the end to join our compensated user research panel.)

3

23

7

6

9K

1

12

1

2

2K

AlexBarry4 retweeted

Elizabeth Barnes

14 days ago

(1) We are likely on track to develop AI systems capable of causing human extinction/permanent disempowerment, quite possibly within the next few years

40

580

56

106

294K

Alexander Barry

16 days ago

@jrosseruk Haha hopefully even more beautiful thanks to Epoch's great graphic design team

0

3

0

0

58

Alexander Barry

16 days ago

@KerryLVaughan @EpochAIResearch Thanks! Hope you are doing well!

0

1

0

0

100

Alexander Barry

16 days ago

Excited to announce I am joining @EpochAIResearch as a senior researcher. My remit will include managing the Epoch Capabilities Index, as well as other projects to understand progress trends. If you have any ideas for improvements/extensions to the ECI please reach out!

6

108

5

7

11K

Alexander Barry

16 days ago

@NunoSempere Great, then yes the ECI is an attempt to create exactly what you describe!

1

1

0

0

29

Alexander Barry

16 days ago

I hadn't thought about reproducibility before, it is an interesting idea. Tracking cost is something I'm very interested to look into in the future (although it isn't totally obvious how to combine it with the current ECI approach), since it is increasingly going to become very clear that spending 1000x more will almost always boost benchmark scores by some amount.

0

0

0

1

244

Alexander Barry

16 days ago

Was fun to work on this as a first application of the domain-specific ECI. I think this (and other) approaches should expand our ability to understand LLM abilities in a more fine-grained way.

@EpochAIResearch

21 days ago

Claude is typically better at software engineering and worse at math than frontier competitors. Aggregating benchmarks to create our domain-specific ECI, we find the Claude family has an average SWE-ECI 2.7 points higher than their general ECI, and a Math-ECI 1.8 points lower.

EpochAIResearch's tweet photo. Claude is typically better at software engineering and worse at math than frontier competitors.

Aggregating benchmarks to create our domain-specific ECI, we find the Claude family has an average SWE-ECI 2.7 points higher than their general ECI, and a Math-ECI 1.8 points lower. https://t.co/J9UrHXTNgq

15

368

29

72

73K

0

16

1

1

1K

Alexander Barry

21 days ago

@finmoorhouse Got me searching for a Claude image generation announcement haha

0

2

0

0

768

Alexander Barry

22 days ago

@YafahEdelman Or making the final bucket 10+ hours which does let it pick up lower performance a the cost of a pretty small n:

AlexBarry4's tweet photo. @YafahEdelman Or making the final bucket 10+ hours which does let it pick up lower performance a the cost of a pretty small n: https://t.co/zPyUD5Tcew

0

0

0

0

47

Alexander Barry

22 days ago

@YafahEdelman Here is my take, I found slightly different buckets more natural (although now we get the weird case where 6-36 hour performance is above 1-6 hours, but this is just what the data actually shows!)

AlexBarry4's tweet photo. @YafahEdelman Here is my take, I found slightly different buckets more natural (although now we get the weird case where 6-36 hour performance is above 1-6 hours, but this is just what the data actually shows!) https://t.co/4SuKBrDWy3

1

0

0

0

53

Alexander Barry

22 days ago

I think (but am not totally sure) that while the METR TH results were based on an early checkpoint of Mythos Preview, the AECI results I used to estimate THs were based on the April 7th launch version. As per AISIs recent updates the launch version seems notably stronger, so presumably its 80% TH would be higher than the early checkpoint, but I'm not sure by how much.

0

1

0

0

126

Alexander Barry

23 days ago

@YafahEdelman @xeophon Seems correct to me (although I'm not sure if mirrorcode itself was a very big update for me, performance has always varied a lot across different tasks in the time horizon suite, see the messiness stuff in the original paper etc)

0

1

0

0

67

Alexander Barry

23 days ago

@xeophon @YafahEdelman I'd assume the claim is that Mirrorcode tasks are unrepresentative of most real tasks, and that the rest of the TH task suit might similarly be unrespesentative as well.

1

2

0

0

84

Last Seen Users on Sotwe

Trends for you

Most Popular Users