I tested the code scanning capabilities of 10 different, widely available, large language models (without cyber guardrails) using the same real-world codebase and asked: how good are the LLMs at finding security vulnerabilities?
I used the same methodology that Iโve been using to find 100s of vulnerabilities in open source software and libraries over the past month.
After model self-review, deduplication of the same findings across models and an independent model assessment, I found 350 distinct vulnerabilities.
The headline: on a single run, no model found more than ~35% of them, and false-positive rates ranged from ~2% to ~30%.
These were our results: ๐งต
I tested the code scanning capabilities of 10 different, widely available, large language models (without cyber guardrails) using the same real-world codebase and asked: how good are the LLMs at finding security vulnerabilities?
I used the same methodology that Iโve been using to find 100s of vulnerabilities in open source software and libraries over the past month.
After model self-review, deduplication of the same findings across models and an independent model assessment, I found 350 distinct vulnerabilities.
The headline: on a single run, no model found more than ~35% of them, and false-positive rates ranged from ~2% to ~30%.
These were our results: ๐งต
We've been doing a lot of scanning of open source repositories in the past month or so since @OpenAI opened up it's Trusted Access for Cyber program.
Weโve had impressive results with GPT 5.5 without guardrails as part of that program. This led us to test out the effectiveness of other, more widely available models.
The question was, could good (or bad) actors without access to a formal program like those from OpenAI or Anthropic use other models to find similar vulnerabilities in source code as well?
The short answer is yes. And also that different models may find different types of vulnerabilities so you might want to have a multi-model approach to your AI code scanning efforts.
And for us the surprising star of the show? @cursor_ai 's widely available Composer 2.5 model. Best price / performance by a significant margin.
So if you're not in a geography from which you can get access to either Anthropic or OpenAI's security programs, you do have options (and of course, so do the bad guys, so let's get going!).
One finding I found surprising: @cursor_ai's Composer 2.5 model had a surprisingly low false positive rate, while being one of the models that found the most vulnerabilities.
I cannot overstate how powerful codex is for cybersecurity work.
I'd encourage all defenders to sign up for Trusted Access for Cyber (https://t.co/e1Mh8aZArY) and give it a shot for their workflows.
If orgs are slow to get TAC approvals, please reach out to me.