@mattshumer_ Yeah looks awesome - any idea how they calculated the $0.19-$0.49 PPM tokens? They say it's based on $2/hour H100 cost and serve rate of 0.03 ms / token I think?
@Thom_Wolf It's a reference to the fact that an ensemble of all submissions would have scored 81% on the private test set (i.e. 19% of solutions were unsolved by any solution) https://t.co/pwdwjJFn2P
Does this mean the ARC-AGI benchmark has saturated?
Yes -- the v1 version of the benchmark is starting to saturate. There were already signs of this in the Kaggle competition this year -- an ensemble of all submissions would score 81%.
The competition next year will run on ARC-AGI-2, an updated version of the dataset that keeps the same format as v1, but features fewer tasks that can be easily brute-forced.
Early indications are that ARC-AGI-v2 will represent a complete reset of the state-of-the-art, and it will remain extremely difficult for o3. Meanwhile, a smart human or a small panel of average humans would still be able to score >95%.
@fchollet Out of interest @fchollet, what % of arc test set puzzles remain unsolved by any submitted solution? And what would the top 2 entries score if ensembled (I know this means they'd have 4 attempts). Just curious how much they overlap.
@jsuarez@hirschibar Awesome write up! What about action masking - i.e. how do you handle cases where certain actions aren't possible (and the env returns you the mask at each timestep). Is this something PufferLib supports?
@NPCollapse Funny story - William Peebles co-authored the Mar 2023 Diffusion Transformer paper on which Sora is based, whilst at Meta as an intern.
But then joined OpenAI last year to co-lead Sora.
So I guess they did know how to do it, but let him leave ๐
@nickfloats Related question / challenge - how do you get Midjourney to output the usual meaning of 'fork in the road', rather than this? Changing the prompt to use different words isn't allowed ๐