@0xAk1@fransrosen@Hacker0x01 the moment you submit a vuln to h1, the vuln is no longer yours, no longer your IP. It's always been like that, nothing changed here AFAIK
I'm not sure the community will like this. @Hacker0x01 will now reuse your novel techniques / exploits / old reports to look for vulns on the rest of the customer's infra. I guess they will add you as collab and give you a bounty, right? right?!
If I understand this correctly, there are three options for Bug Bounty hunters:
(1) Stop reporting novel exploits / techniques to HackerOne
(2) Hoarding bugs until they scanned the entire program's infra in order not to get "duped" by H1 agents
(3) Shrug and continue as usual
@rez0__@joaxcar These vulns are not public, not in training data, otherwise it would be pointless of course (well maybe not pointless but they would have less value as benchmarks). That’s why I never understood why others kept using the old xbow benchs, those are public and therefore obsolete.
A lot of people have been wondering about Mythos, Glasswing, and the vulns we / our partners are fixing. Today, I’m excited for us to start sharing more. (For context, I lead Glasswing @AnthropicAI.)
Two independent evaluations this week—from XBOW and the UK AISI—confirm what we've been seeing internally: Claude Mythos Preview is a step change in autonomous cybersecurity capabilities. We need to start preparing fast for a world of models with this level of capabilities.
The UK AI Security Institute tested the model we shipped at the launch of Project Glasswing and found Mythos Preview is the first model to solve both of their end-to-end cyber ranges, including one (Cooling Tower) which no model had ever cleared. But attackers (and defenders) have sophistication & cost constraints – Mythos is also the only model that clears every one of their tasks estimated over 8 hours under their deliberately low 2.5M-token cap.
XBOW tested it on their offensive security benchmarks, finding "token-for-token, unprecedented precision." It's the only model to succeed at subtle V8 sandbox work.
Other Glasswing partners shared similar stories. In a few weeks of testing, Mythos Preview has helped them find many thousands of (estimated) high + critical severity vulnerabilities, sometimes double what they'd normally find in a year.
I don't share this to boost Mythos. In fact, this is not about Mythos. It’s about preparing for the coming world of models being better, faster, cheaper, and more creative than some of the best human experts at dual use capabilities. Clearly, we need them supporting defenders as widely as can be done safely – and especially the least resourced ones.
Within a year, Mythos will probably look quite dumb (relative to other new models). And others may release openly available or unguardrailed models of Mythos-level capabilities.
We started Project Glasswing because capabilities like Mythos Preview's won't stay rare, or stay in careful hands. We are bringing it to defenders as fast as we responsibly can, while working to figure out, for example, the right safeguards and patching & disclosure processes.
Also, to be clear, compute has never been a limiter in our rollout.
Expect a fuller update on our Glasswing work in the coming days.
XBOW report: https://t.co/Mumtbf3kE3
UK AISI report: https://t.co/vBgqz0AeKJ
Security is an economic decision.
For a fixed cost, within @XBOW, which model has the best odds of crafting an exploit?
GPT-5.5 > Mythos > Opus 4.6 on real OSS web vulns.
Curves below.
For the past 2 months, XBOW has been testing Mythos Preview under embargo as part of a select early-access group.
Today, we can finally share what we found.
The headline: Mythos Preview is a major advance. It is substantially better than prior models at finding vulnerability candidates, especially when source code is available.
But it’s not perfect. We surfaced issues with exploit validation, judgment, and efficiency.
Our full write-up covers where Mythos Preview shines, where it still needs support, and what we think this means for the future of offensive security: https://t.co/wPIhNeztO9
But even if it wasn’t useful for this specific use case, it excells at most of the offensive tasks we evaluated. Every model has a specific place in you harness and there is no model that is just “good at everything”.
A few months ago we had access to Mythos. I was lucky to be part of the group of people experimenting with it. My personal take: there is nothing close to it. With the right harness you can throw it at anything with excellent SNR. Official comm: https://t.co/dzNh3S0CoQ
It was extremely conservative. Even after doing ~20 rounds of prompt eng to optimize the prompts to the new model, it was extremely conservative, dismissing other agents work as “informational” even when there were actually actual leads or exploit chain links hidden in the trace
There is something really addictive about having lots of agents in flight; it’s the same feeling I used to get when I had a big fuzzing job, scraping run, or “compile all the things” type of experiment: the feeling that somewhere silicon is working tirelessly toward your goals
The cat's out of the bag! My latest book, "The Secret Life of Circuits", is available in early access:
https://t.co/ormpiPwapu
It's what I wish I had when I was starting out. Electrons to embedded systems, 290+ color illustrations and 420+ pages of well-explained theory.
We had early access to Opus 4.7 and ran it against real exploit targets.
First look: fewer vulns found per run than 4.6. We almost wrote it off.
Then we realized we were counting completions, not tokens. Opus 4.7 takes smaller, more precise actions. Normalize by token budget and the picture flips, it finds more, for less...
How you measure matters as much as what you measure.
Check out @thewunderalbert blog post https://t.co/P5cf3Kr9G0