Here's how I (almost) got the high scores in ARC-AGI-1 and 2 (the honor goes to @jeremyberman) while keeping the cost low. To put things into perspective: o3-preview scored 75.7% on ARC-AGI-1 last year while spending $200/task on low setting. My approach scores 77.1% while spending $2.56!
New SOTA on ARC-AGI
- V1: 79.6%, $8.42/task
- V2: 29.4%, $30.40/task
Custom submissions by @jeremyberman and @_eric_pang_ are now the best known solutions to ARC-AGI
Both:
* Are open source
* Use Grok 4
* Implement program-synthesis outer loops with test-time adaptation
@karpathy@goakhmad Agree with most points except that golden age of movies started in the 80s. imo 70s Hollywood was the most experimental with the death of the counterculture movement and the end of the Hays Code. Obviously worldwide cinema had a different peak period also.
Coppola showing up at Cannes 1979 with Apocalypse Now, still mostly insane from being in the jungle too long, just spitting bars is what it's all about
Thanks for the cover! My architecture graph does not have a typo: when it's evaluating on the public eval set, the actual test outputs are given, so the system does check if the best program gets 100% on test examples. You are right that we don't know the answers for the submission run.
@FraserGreenlee Yes, I think this point is underdiscussed. My solution has higher accuracy and lower cost per task on ARC-1 compared to the average human.
The same reason is why ARC-AGI is the most important benchmark in AI. It is the only benchmark that's not saturated after repeated attempts from players big and small.
Here's how I (almost) got the high scores in ARC-AGI-1 and 2 (the honor goes to @jeremyberman) while keeping the cost low. To put things into perspective: o3-preview scored 75.7% on ARC-AGI-1 last year while spending $200/task on low setting. My approach scores 77.1% while spending $2.56!
New SOTA on ARC-AGI
- V1: 79.6%, $8.42/task
- V2: 29.4%, $30.40/task
Custom submissions by @jeremyberman and @_eric_pang_ are now the best known solutions to ARC-AGI
Both:
* Are open source
* Use Grok 4
* Implement program-synthesis outer loops with test-time adaptation
That's right, when the system attempts the first task, it skips the program fetching step since library's originally empty.
If you want to see how the library is evolved, check out https://t.co/0mz7O2mClG. This is the resulting library after the system attempts the ARC-2 public training set to build Knowledge Priors.
Excited to announce Hyperbolic's partnership with the ARC Prize (@arcprize), a groundbreaking competition pushing the frontiers of AGI! Receive up to $1000 in compute credits. 🧵
@joshlee361 @arcprize@jeremyberman My solution is cost-efficient. It costs <$500 to fully test on the public eval set with Grok-4. You can decrease the cost further with a more lightweight model.