What does it look like when Claude, GPT, and Gemini try to hack each other in real time?
BattleBench is a cybersecurity benchmark I built where AI coding agents battle in identical vulnerable containers. They scan networks, exploit opponents' vulnerabilities, and submit captured flags to a referee. The referee kills the loser's container. Last agent standing wins — all running simultaneously with no human intervention.
What's live now:
→ ELO leaderboard across multiple scenarios
→ Full terminal replays of every agent's session
https://t.co/G9iRb8ilio
BattleBench has seen ~275 games played. Likely not enough to yield anything truly insightful yet but a few things already stand out.
- gpt-5.2-codex is a beast.
- smaller/earlier models are more likely to refuse to play
- the agents are obviously faster than humans will be.
I'm eager to see how the benchmark plays out over 1000+ games and how the latest gpt spark models compare with opus 4.6 fast.
Go watch codex destroy its opponents (except Opus) and let me know if you have any feedback
>January 2024
>OpenAI announces the gpt store.
> see that GPT usage will pay out devs
> sameday publish 9 gpts in the store.
> 2 hit 10k+ chats
> never hear about gpt monetization again
This is partly crazy because some orgs have been monitoring for and then revoking tokens on behalf of compromised victims the past couple weeks - now that proactive community defense becomes even more destructive for the victim. savage
Most important professional skill has gotta be story telling.
Really doesn’t matter what your job is, if you are competent + a story teller you will fly high.
@BVeiseh I wish agents could make high quality TRRs at scale. I have a little prototype of it but would love to see an ai security startup like mindfort do it for the masses.
Would earn a ton of goodwill in the community if done right.
appreciate the post thanks for sharing.
"Organizations are making security purchasing decisions based on a threat model that assumed attackers would not be able to study how their defensive products actually work."
The above was probably always a bad idea with or without AI lol.
the 'what defenders should do' section is a great set of action items for lots of orgs.
although ps logs ime are quite expensive, surprised to see them characterized as cheap here.
overall think the post does a good job helping defenders focus on the important things - thanks.
Company: Snap
Cut: ~1,000 - 16%
Evidence: CEO letter cited AI reduces repetitive work, increases velocity and enables smaller teams
https://t.co/kuGOnpYjlU
Snap, parent company of Snapchat, is making a massive workforce reduction — eliminating 1,000 jobs, representing 16% of its current employees, in a move to accelerate net profitability. In addition, the company is closing 300 open roles.
CEO Evan Spiegel believes “rapid advancements in artificial intelligence” will help smaller groups work better.
https://t.co/3quj1GNDcO
Coding agent logs are much like powershell logs 2 me.
You shouldn’t really need them to make factual claims about what happened but wow they are extremely helpful for why/how context when performing investigations.
And just like ps logs, verbosity means most can’t afford it.