How close are current AI agents to automating AI R&D? Our new ML research engineering benchmark (RE-Bench) addresses this question by directly comparing frontier models such as Claude 3.5 Sonnet and o1-preview with 50+ human experts on 7 challenging research engineering tasks.
@FinishItPod After 6 weeks of listening, I've just finished episode 106 of (yes, I've listened to the first 9 episodes in one day)
Sure hope you still do complis and concris because I want mine in 4 months...
Also, the the abominable snowman 1 isn't on spotify