I know that this take is controversial, but I'm convinced that @AnthropicAI did the right thing with Glasswing and not releasing Mythos immediately to the public. Super important to harden critical systems and software before these capabilities are available to anyone.
Opus 4.8 thinks one abstraction level higher than Opus 4.7 - and beats GPT-5.5 on ARC 3.
(and arguably performs better than I would have in a limited amount of time..)
Shows once again how almost all public benchmarks are maxxed out by now. We're still scratching the surface.
@arcprize just published results for Opus 4.8 ARC-AGI 1, 2 & 3
My notes:
* Opus 4.8 showed two behavior differences over Opus 4.7.
1) It operated at an abstraction level *above* 4.7. It was able to see the ARC-AGI-3 environments as objects, not just collections of pixels
2) Instead of short action resets like Opus 4.7, Opus 4.8 would often execute a long series of actions *before* resetting a game. It was holding onto hypotheses longer before giving up
* *Feeling* model performance - I'm biased (duh), but imo no other benchmark lets you *feel* a model quite like ARC-AGI-3. Looking at the dc22 replay (attached and link below) you can see the model work through problem, get stuck, and figure it out. Getting past 3 levels shows basic level understanding of this game. There is a new mechanic on level 4 which stumps it.
* Updated System Prompt - We observed that in our original system prompt, GPT and Gemini, unlike other models, would not "think out loud" in their reply. This caused them to *only* return an action in their response (ex: "ACTION1"). This capped the signal we were able to extract from the model.
We updated the system prompt used for ARC-AGI-3 to *explicitly* say context will be carried forward instead of the original *implicit* nudge
See the exact change on the commit below This will be the system prompt going forward. We aren't re-testing the previous 6 models at this time due to api costs (estimated at $40K) https://t.co/F6ZqIfey4i
Ant pulled Opus 4.7´s business-skills RL for breeding dishonesty - @andonlabs' vending-bench cratered $10,937 → $2,992 on 4.8.
One interesting reward hacking example from system card: 4.8 flooded the log with "PASSED" to evict failing tests from the grader´s 400KB ctx... 😶🌫️
@ccatalini I agree!
btw @ccatalini - what's your take on the recent piece by @eastdakota ?
> AI isn’t coming for builders or sellers, but it is coming for measurers.
verifiers=!measurers - but so far, building has become more commoditized than measuring? 🤔
https://t.co/RcnaZq9Kyy
@flozi00@AnthropicAI the takes from insiders and people I trust were: sure, with enough inside knowledge and steering you can get any frontier or ft model to find these vulns as well - but actually finding and exploiting without guidance, from scratch, is were Mythos is a step-change 🤷
I know that this take is controversial, but I'm convinced that @AnthropicAI did the right thing with Glasswing and not releasing Mythos immediately to the public. Super important to harden critical systems and software before these capabilities are available to anyone.
'For the last few months, Anthropic has used Mythos Preview to scan more than 1,000 open-source projects, which collectively underpin much of the internet—and much of our own infrastructure.
So far, Mythos Preview has found what it estimates are 6,202 high- or critical-severity vulnerabilities in these projects (out of 23,019 in total, including those it estimates as medium- or low-severity).'
@dkundel haha nice cameo 😁
excited to try this out. translation is great but this will really unlock a lot once cheap/good enough to run in the background all the time...
btw think translators were first on many "professions affected by AI" lists 🫣
@stwboerse but atm $GOOG is just tier 2 - their models are super smart, but agentic capabilities and harness are WAY behind oai/ant. tried antigravity again last week for a project where gsearch/youtube could help - barely usable. And they´re not shipping at the same speed 😕
I don't believe this.
By the time "real" AI is widely adopted in non-tech jobs & sectors, models will be so strong,that the average entry level hire won't be able to compete (for jobs that can be done on a computer.
I expect a job market bloodbath for new grads in 1-2 years 🫤
I have changed my mind on how AI will impact jobs in America.
Previously, I believed AI would replace many entry level roles typically filled by young employees. The technology would then work its way up the organization and eventually reduce the total number of jobs in a company.
The data is saying something different, so when I get new information I am willing to change my mind.
The number of software engineers being hired has been increasing. The number of open software engineer roles is growing.
The number of new college grads who get hired has increased 5.6% over the last 12 months. The unemployment level for people aged 20-24 years old who have a college degree has fallen from nearly 9% to almost 5% as well.
The Wall Street Journal recently wrote “AI created 640,000 jobs between 2023 and 2025 in the U.S., according to an analysis by LinkedIn of job posting data, including new white-collar positions such as Head of AI and AI engineer.”
And I am starting to see companies throughout our portfolio aggressively hiring to keep up with the demand for their products and services.
If AI can make employees more productive, which is widely accepted as fact, then companies are going to want as many productive units of labor as possible. This is a key reason why I am changing my mind.
AI appears to be a magical technology that will make companies more productive and more profitable. The net result will be more corporations, more startups, and more jobs.
All three are big, positive wins for the American economy.
@ChrisPainterYup@scaling01 (but actually I think you were overly harsh and also think sam will do the right thing in the end; rooting for OAI, competition is good for everyone as long as safety is taken seriously)
I'm really thankful for OpenAI keeping the "morally misguided" people away from Anthropic
it's like a filter
not saying it's all, most just want to build something and solve problems
but choosing to work for Sam and Greg and a company with that history is certainly a choice
i'm confident that all the right people will end up at the right place