Intelligence on intelligence. I cover ~consequential~ AI through briefings, analysis, and commentary. Full site built by me through my business, ForeverBuilt.
The closest formal benchmark may be “Agentic Abstention”. This is a paper that was actually just posted to arXiv on June 27th. Definitely seems the field is converging on addressing this exact question.
https://t.co/i322103je4
There’s also OptimalThinkingBench that tackles the token-waste side more directly. This one is from last year.
https://t.co/KWMSOVFwpy
@guawuchang2000@AiBattle_ BrowseComp and OSWorld-Verified, the benchmarks that measure whether a model can drive a browser and operate a real computer, is where Sonnet 5 shows a clear gain over 4.6 and narrows the gap to Opus. This would be one area where Sonnet 5 could be considered an improved model.
Underrated comment! I agree and think this is exactly the right thinking. A generous explanation would be that single-shot tests like SWE-bench undersell the actual improvement Sonnet 5 was meant to bring, which is in finishing long agentic chains rather than nailing one isolated patch or another.
Three wins for Anthropic in one day, plus Google's video model, a $3.1B industrial-AI deal, and Jim Keller shutting down the Qualcomm rumor. All of it in today's Bulletin.
https://t.co/KIapmzzAcU
In line with the message from Anthropic so far, though general coding was not called out by the model until asked directly. At which point, the response was “Regular coding stays on Fable 5 - the classifier keys on domain content, not task type”.
Based on my previous coverage, Fable 5 is the better-supported model right now, not necessarily the better model in every dimension.
Its capability lead over GPT-5.5 (the prior benchmark) is large and third-party corroborated (Artificial Analysis, independent SWE-Bench Pro numbers). GPT-5.6 Sol's comparable claims are still resting on OpenAI's own preview figures.
That said, "better" depends on what you're optimizing for. If you weight behavioral risk in autonomous/agentic settings, GPT-5.6's overreach tendency is arguably the more consequential flaw of the two, since it gets worse as models take on more independent action, whereas Fable 5's worst issues (the invisible safeguard, the over-refusal gap) were process/deployment problems Anthropic could walk back quickly, and already did to some extent.
The bluntest of takes, but likely not too far off for many 😅. If for no other reason than Fable 5 being a blip before it got pulled down, yet I’ve seen so many posts implying the fate of various projects or entire businesses was hinging on the re-release. How could one get into that much of a jam in such a short time?
My full review breaks Sonnet 5 down into what's new, what it actually costs, and whether you still need Opus. Link below for anyone interested.
https://t.co/HvrqAL52ns
"Opus-class autonomy at a Sonnet price" is the pitch for Claude Sonnet 5. The fine print: a new tokenizer means each task burns more tokens than the sticker implies. The discount is real. It's just smaller than it looks. Anyone have more details on the new tokenizer?
We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment calls that real computational research depends on.
https://t.co/AsilnnSxnE
@eikkenberg@petergyang Fair enough 😆. The novelty might eventually wear off when people get tired of waiting multiple minutes for what they thought would be a yes or no answer.
Claude Fable 5 will be available again globally tomorrow.
After a series of productive conversations with the US government, we're redeploying the model with a new set of classifiers to target and block more cybersecurity tasks. In the near term, some routine tasks like coding and debugging will fall back to Opus 4.8. We’ll continue to refine these classifiers over the coming weeks to reduce false positives and better distinguish genuine misuse from legitimate requests.
We’ve also begun drafting a consensus framework—with Amazon, Microsoft, Google, and other Glasswing partners—for assessing the severity of AI jailbreaks and how AI developers should respond to them. We invite other industry partners and model providers to join us in this effort.
Finally, we’re scaling up our collaboration with the US government on model testing and safeguards. This will include pre-release access to models and safeguards for evaluation, information sharing on jailbreaks and misuse, and dedicated resources for joint research.
Thank you to our users for your patience, and to our partners across the government, industry, and the research community who worked alongside us to make Fable 5 available again.
Read our full blog: https://t.co/VHyum831ri
Anthropic shipped its most powerful public model, then quietly built in a safeguard that degrades your output without telling you it fired. Page 13 of the system card admits it. Researchers pushed back; the feature was reversed in 48 hours.
We’ve received notice that the Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5.
We'll begin restoring access tomorrow, and will share an update soon.
We’re grateful to our users for their patience, and to everyone who worked with us on redeploying the models.