Claude Opus 4.5 is now OUTPERFORMING Claude Opus 4.6 on BridgeBench Hallucination.
Read that again.
The legacy model is beating the current flagship.
We benchmarked Opus 4.5 this morning to confirm what we saw yesterday.
Claude Opus 4.6 fell from #2 to #10 with a 98% increase in hallucination.
Now Claude Opus 4.5 is scoring higher.
This isn't a bad benchmark run.
This is a nerfed model.
Anthropic silently reduced Claude Opus 4.6 and the data proves it.
You're paying $200/month for a model that's getting worse.
@bridgebench will keep tracking it.
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
New post: We tested the Mythos showcase vulnerabilities with open models.
They recovered similar scoped analysis! 8/8 models found the flagship FreeBSD zero-day, including a 3B model.
Rankings reshuffle completely across tasks => the AI cybersecurity frontier is super jagged!