These plans are a pure gold mine for Apple. The vast majority of people will pay $240 a year for the next several years and never use it. That’s why insurance is a great business!
We got a call from @xai 24 hours ago
“We want to test Grok 4 on ARC-AGI”
We heard the rumors. We knew it would be good. We didn’t know it would become the #1 public model on ARC-AGI
Here’s the testing story and what the results mean:
Yesterday, we chatted with Jimmy from the xAI team, who wanted us to validate their Grok 4 score. They did their own testing on the ARC-AGI-1 & 2 public evaluation set
To validate their score (and measure possible overfitting), we self-tested the new model on our semi-private evaluation set
We walked them through our testing policy:
* No data retention
* Model checkpoint must be intended for public use
* Temporary increase in rate limits for burst testing
They were on board, so we got started
Initially, we ran into timeout errors with normal requests, so we switched to streaming. That resolved the issue
So, what do these results mean?
First, the facts: Grok 4 is now the top-performing publicly available model on ARC-AGI. This even outperforms purpose-built solutions submitted on Kaggle.
Second, ARC-AGI-2 is hard for current AI models. To score well, models have to learn a mini-skill from a series of training examples, then demonstrate that skill at test time.
The previous top score was ~8% (by Opus 4). Below 10% is noisy
Getting 15.9% breaks through that noise barrier, Grok 4 is showing non-zero levels of fluid intelligence
But the mission isn’t over. We need new ideas to solve ARC-AGI-2. Scale alone won’t get us there
Come work on ARC-AGI with us
The fireworks in your mind. 🧠✨ This sparkling video shows the neurotransmitter glutamate being released into synapses, made possible by an indicator developed by @abhi_aggarwal1, @PodgorskiLab, and team.
#HappyNewYear#NYE
Grok-4 benchmark leak just dropped.
-HLE: 35 → 45 w/ reasoning
-GPQA: 87 → 88
- AIME’25: 95
- SWEBench (Code model): 72 → 75
If validated, Grok-4 is flirting with Claude Opus territory. Release looks imminent. xAI is officially in the frontier model race.