We have a date, we have a time: The second round of tickets for #BSidesVienna will start on Sunday 26.10.2025 at 19:00 Vienna time UTC+2! Do not miss it, this might be you last chance. Spread the word!
We are alarmed by reports that Germany is on the verge of a catastrophic about-face, reversing its longstanding and principled opposition to the EU’s Chat Control proposal which, if passed, could spell the end of the right to privacy in Europe.
https://t.co/015qmQnIS2
I've been researching the Microsoft cloud for almost 7 years now. A few months ago that research resulted in the most impactful vulnerability I will probably ever find: a token validation flaw allowing me to get Global Admin in any Entra ID tenant. Blog: https://t.co/jD6EaGtsn3
It is an interesting take, but not the hack mentioned, from the screenshot, it looks like a basic script-kiddie hack, brute-forcing credentials on an exposed RTSP camera (rtsp://[username:password@]ip_address:port/path).
Still, the bigger question behind it is interesting:
* vulnerable software is simulatable.
* penetration success is verifiable.
* hacking is RLable.
More importantly, the assertion that vulnerable software is simulatable begs and exploitation success is verifiable suggests we could apply RL to vulnerability research. But this claim requires much deeper understanding from first principles to determine if vulnerability research can truly be solved through RL.
# Science of a Vulnerability
If you ask a top security researcher how they find bugs in complex software like Chrome's V8 engine, you'll get the same vague answers Cristiano Ronaldo gives about scoring goals: "hard work, practice, experience, mindset." You won't get the real answer about what happens inside their brain and body when they spot a vulnerability or hit a goal, because it's tacit knowledge learned through years of experience.
But here is my attempt to hypothize what vulnerability research actually is:
> Vulnerability research is recursively understanding software by processing code and documentation, forming highly abstract concepts in memory similar to state machines, then recursively applying reasoning to these abstract concepts to find weird states we call vulnerabilities.
The recursive nature is crucial both for identification of weird state machines and for processing the information itself. To understand V8 code, you need to understand JavaScript engines. To understand JavaScript engines, you need to understand compilers. To understand compilers, you need to understand computer architecture. And so on.
A similar recursive loop applies while identifying a bug. Once you have the understanding and abstract concepts, you apply recursive reasoning in multiple layers. A type confusion could allow remote code execution, but to find a type confusion, you need to understand how type confusion works, then reason about the code to find where type assumptions can be violated. And so on.
# Can we do RL powered vulnerability research?
Now that we've defined what vulnerability research entails, let's examine the prospects of applying reinforcement learning to this domain.
For successful application of RL to vulnerability research, there are two critical requirements:
1) Environment that can simulate "real" vulnerabilities
2) Good feedback signals (rewards) for trajectories that lead to vulnerability discovery
At first glance this seems doable, to create an environment with good signals, we could harvest all Chromium security issues, generate training data, and perform something like GRPO (Group Relative Policy Optimization) for successful vulnerability identification compared against the oracle of original bugs. We'd have both a simulated environment and grounded reward signals.
But there are fundamental caveats that make this approach less promising than it initially appears.
# Caveat 1: Variant Analysis
This is not a caveat but, I argue that if we know what vulnerability to simulate, then you don't need an RL-trained hacking LLM, you could just use Gemini 5 or Claude 5 to find that vulnerability. These models would likely be able to reason through it as a side effect of being trained on mathematical reasoning.
Remember that "programs are proofs, and proofs are programs, and a vulnerability is also a proof for a program." If we've already identified the vulnerability class and can simulate it, we've essentially solved the hard part. The actual discovery becomes a reasoning task that future general LLMs can likely handle.
This means RL would only be valuable for finding completely novel vulnerability classes , discovering entirely new knowledge rather than variants of known bugs.
# Caveat 2: The New Knowledge Problem
Finding a new vulnerability is a completely different beast. It's like discovering new knowledge in a state machine, the bigger and more complex the state machine, the harder it becomes to find novel vulnerabilities.
Consider a real example: when V8 introduced the new heap sandbox cage, a V8 vulnerability researcher receives this initial prompt: "V8 introduced this heap sandbox cage, bypass the cage."
This represents completely new information, so the researcher has to:
1. Dig into design documents and code to completely understand the cage mechanism
2. Build mental models of how it constrains memory access
3. Start reasoning about the state machine to find ways to escape the cage
4. Iterate through countless failed attempts
That single prompt would eventually lead to finding a bypass, but only after the bypass is discovered do we get any reward. The reward signal is incredibly sparse, you might spend weeks or months understanding the system with zero positive feedback until that eureka moment when you find the escape vector.
Isn't this level of sparse reward would be extraordinarily difficult to simulate and even harder for an RL system to navigate effectively?
# Why AlphaProof Succeeded Where RL Hacking Would Fail
The recent Alphaproof breakthrough might illustrates exactly why mathematical theorem proving works for RL while vulnerability research doesn't.
Its success can be completely attributed to having an incredible environment and feedback loop from Lean programming and the Lean theorem prover. You get immediate feedback on whether the formal method you provided as proof is correct, and then RL systems like AlphaGo can handle the remaining search task.
In my opinion, VR has no immediate feedback. It would be hard to provide a better feedback loop for finding completely new vulnerabilities. Most of the exploration is useless, until a point where weird machine appears. I would argue that finding a needle in haystack is where RL might fail.
That doesn't mean this problem is unsolvable. Problems are soluble.
# A potentially better way to solve hacking?
Rather than trying to create "AlphaHacker" with sparse vulnerability rewards, we provbably should focus on improving base LLMs on mathematical reasoning tasks. The side effect will naturally lead to better vulnerability research capabilities.
This mirrors how humans actually learn. We don't start with tabula rasa before vulnerability identification. We build reasoning capabilities through other methods, mathematics, nature, programming, then apply those skills to security research.
I would argue that LLM that do better on math and programming domains, will naturally transfer to vulnerability research. I will revisit this blog in a year, if its true or not.
There’s a new illness. I call it “LLM Dependency Syndrome.”
You can save that $1,200/month by writing simple code:
>extract phone numbers: regex
>check profanity: blacklist
>reformat JSON: json parser
>uppercase text: .upper()
These few lines of code are faster, cost nearly $0, and are more accurate than LLMs (which can hallucinate).
This really separates those with CS/coding background from those without.
More than coding, a massive benefit of LLMs is brainstorming technical implementation details.
Specifically for students: they’ve always had StackOverflow to copy text from, but it was HARD to get opinions on architecture and design decisions etc.
Now you can discuss algorithms and data structures and different approaches to solving problems, and debate their pros and cons.
I find myself using Claude for a lot of planning and discussing features before we even get to coding anything.
Much like humans, CPUs heal in their sleep.
CPUs are *technically* replaceable / wear items. They don’t last forever.
Yet, the moment stress is removed, transistor degradation (partially) reverses.
It's called Bias Temperature Instability (BTI) recovery:
Then: We were the kids who saw the blinking cursor not as a barrier, but as
an invitation. We typed characters into the voids and got back secrets. Our
goal was not destruction, it was understanding — to understand the systems
better than those who built them. The thrill of "getting in" was matched
only by the beauty of making something out of nothing.
Now: Hacking is a job title. Curiosity has been commodified. A thousand
"Bug Bounty Platforms" are trying to monetize your desire for
understanding, to turn it into CVEs and T-shirts. CTFs have become
resume-building exercises. Reverse engineers wear corporate badges.
Developed by government employees rather than openly in the community,
exploits get embargoed, not shared.
The paradise of the underground has been paved over by venture capital and
compliance frameworks, steamrolling everything we used to stand for.
🚀New plugin in the Caido Store!
Introducing "NewRequests" by @0xntrm
Identify which requests follow a certain action by filtering out the HTTP History table with a hotkey.
Check out more details: https://t.co/RB2ruuxdxT
Well, I'm ✨brainfried✨ from this workday anyways and it's Friday evening here, so why not analyze the newly dropped Laravel Vulnerability (CVE-2024-52301). If I got something wrong, let me know!
Well, I'm ✨brainfried✨ from this workday anyways and it's Friday evening here, so why not analyze the newly dropped Laravel Vulnerability (CVE-2024-52301). If I got something wrong, let me know!