GPT-5.5 (xhigh) sets a new pass^5 high score on ZeroBench
pass@5: 22% (SOTA 23%)
pass^5: 10% (prev. SOTA 8%)
Best 5/5 reliability so far
Strong result from @sama and the OpenAI team
The matrix shows how often the models answered each question correctly across 5 samples
Leaderboard: https://t.co/NouEsFxJEM
Data: https://t.co/mD8Eptr9M5
The Claude models are great for coding
But on visual reasoning they still trail the frontier
On ZeroBench (pass@5 / pass^5):
Opus 4.7 (xhigh) - 14 / 4
Opus 4.6 - 11 / 2
GPT-5.4 (xhigh) - 23 / 8
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
📣📣 New SOTA for GRAB-lite, our graph analysis benchmark
GPT-5.4 crushes Opus 4.6: 71.0% vs 45.6%
Impressive work from @gdb and team
Updated leaderboard now live
Following the Python SDK, Browserbase support is now live in the warpsurf Node SDK
Run warpsurf on remote browsers from Node.js
@warpsurfai x @browserbase
warpsurf can now run on remote browsers
The warpsurf Python SDK now supports Browserbase
Run warpsurf without managing the browser infrastructure yourself
@warpsurfai x @browserbase
There’s now a Node SDK for @warpsurfai
Run browser workflows from TypeScript on Node.js
Use it in your own scripts, tools, and pipelines
Details and repos below 👇