The way we are currently evaluating LLMs for vulnerability detection might be flawed!
New paper w/ @mboehme_ .
Paper: https://t.co/7VQepP4jn1
Code: https://t.co/g5teasvIIj
(1/10)
Agents like #OpenClaw have taken the world by storm. But how easy is it for attackers to achieve their goals while your agent is solving your task?
#VeriGrey uses ideas from greybox fuzzing to evolve skills that are most effective at achieving an attacker goal.
@AbhikRoychoudh1
Chasing Shadows: Pitfalls in LLM Security Research - https://t.co/7FoQJ6E31s
We assess the prevalence of these pitfalls across all 72 peerreviewed papers published at leading Security and Software Engineering venues between 2023 and 2024. We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unrecognized.
To understand their practical impact, we conduct four empirical case studies showing how individual pitfalls can mislead evaluation, inflate performance, or impair reproducibility. Based on our findings, we offer actionable guidelines to support the community in future work.
Authors: Jonathan Evertz, @niklas2484, Nicolai Neuer, Andreas Müller, Philipp Normann, Gaetano Sapia, Srishti Gupta, David Pape, Soumya Shaw, Devansh Srivastav, @chwress, @ErwQui, @_thrsten, @darpsec, @leaschnherr - @KITKarlsruhe, @ruhrunibochum, @tu_wien, @SapienzaRoma
#LLMSecurity #AISecurity #SecurityResearch #Reproducibility #Benchmarking #Evaluation #RedTeaming #PromptInjection #DataLeakage #ContextWindow #ModelCollapse #MLSecurity
Our paper "Scaling Security Testing by Addressing the Reachability Gap" has been accepted at #ICSE26!
New paper w/ @mboehme_.
https://t.co/tLhkTIY0Eo
https://t.co/rbpxJZLFp6
Crawling isn't innate (unlike walking). Every baby must *invent* crawling, from scratch, using extremely little data, and no reference to imitate. Which is why different babies end up with different ways of crawling.
Sometimes people tell me, "you say AI isn't intelligent until it can invent, but most humans can't invent anything either!" -- in reality, we are all constantly inventing. Even babies are inventors. You couldn't navigate a single day in your life if you weren't capable of invention.
The only bitter lesson is that LLMs have succeeded beyond any expert expectations.
Underpinning LLMs is the idea of scaling, which is too often misunderstood as more parameters. Scaling is about using massive compute effectively to maximise the throughput of data ingestion into the learning process to obtain more capable models.
We are still far from hitting the limits in this. We are still compute hungry because there is a ton more we could achieve if only we had more compute, from experimental ablations to data acquisition and curation.
Scaling is largely about data and evals. The models are now trained on almost all the web and equally large (but growing) self generated synthetic data. sifting through such vasts quantities of data (the whole of the human creation) requires formidable engineering and intelligent ideas. This is what differentiates most models.
AI is finally in the hands of billions of users, and with it come billions of tasks - every reasonable user need. This scaling in tasks and evaluations is many orders of magnitude larger than pre-LLMs.
Having the right architecture matters, but we know several alternatives could all work well, eg replacing attention in Transformers for RNNs and interleaving such layers with local layers. What matters is fine ablations to maximise hardware usage. This is the realm of sophisticated high-precision engineering. It encompasses semiconductor design, datacenter design, distributed systems, MFU, etc. There is fascinating work on flow matching, JEPA, sparser MoEs, etc, that is all consistent with scaling.
I’m terrible at predictions, but in this we have stayed the course. There’s been pleasant surprises like the effectiveness of reasoning, which while allowing for less parameters, still demands even more compute.
Sparser multimodal MoEs also will allow for better continual learning. This is an old idea, eg https://t.co/ZjqVwwoy5L, which is finally being done at scale.
Successful scaling is mostly about organising people into effective teams for research, development and production. They have to be teams of happy and ambitious people who put the team first. Yes, tech VCs and CEOs: work life balance matters to achieve prologued success, something I think @demishassabis did really well at @GoogleDeepMind and which I promote at @MicrosoftAI.
Bitter lesson: it really is all about scaling and hard work by thousands of amazing people. Hardly bitter, but hopeful and inspiring.
Looking for a PostDoc, a PhD, and 3-6mth interns as part of my ERC project.
Homepage: https://t.co/fPr9gVYYIK
Böhme Lab: https://t.co/TLd4TstfJF
Reach out if you find this interesting. 👇
The proprietary frontier models of today are ephemeral artifacts. Essentially very expensive sandcastles. Destined to be washed away by the rising tide of open source replication (first) and algorithmic disruption (later).
the chunk-based approach seems neat, but it's also another (func-level) ML4VD paper with the common flaws imho.
I think people publishing in this domain should at least cite and address @niklas2484's "Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection". 80%+ acc won't be reflected in real life.
Can we statistically estimate how likely an LLM-generated program is correct w/o knowing what is a correct program for that task?
Sounds impossible-but it's actually really simple. In fact our oracle-less eval can reliably substitute a pass@1 based eval.
https://t.co/yoC0lThExe
Proud to share that our paper “Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection” received an ACM Distinguished Paper Award at ISSTA 2025 in Trondheim, Norway.
If you’re interested, the paper is available here: https://t.co/H73meyVy6Z
Thrilled to share a recent opinion piece at the IEEE Security and Privacy (Vol. 23, Issue 3).
Basically a long-term perspective on the field meant for both researchers and practitioners.
📝 https://t.co/VjRXXQcFVL
Our paper "Top Score on the Wrong Exam" paper will be presented at #ISSTA25 🐣 in Trondheim!
📝https://t.co/m7FrZ36ach
🧑💻https://t.co/VoItZD0KVy
// @niklas2484@fuzzjing.
Apple gets it. Robots are going to be everywhere, but they won’t look like robots. Check out their new paper ELEGNT.
I believe this is the future of everyday objects: helpful and human.
1995-2025: The Decline of Germany & Japan vs US & China. Can All-Purpose Robots Fuel a Comeback?
In 1995, in terms of nominal GDP, a combined Germany and Japan were almost 1:1 economically with a combined USA and China. Only 3 decades later, this ratio is now down to 1:5!
Self-replicating AI-driven all-purpose robots may be the answer. Around 2000, Japan still was the country with the most robots; Germany was 2nd. Today, China is number 1. However, most existing robots are dumb. They are not adaptive like the coming smart robots that will learn to do all the jobs humans don't like, including making more such robots.
The text below extends an AI-based translation of my 2024 F.A.Z. guest article [FA24] written for a German audience, but its basic ideas apply to all countries with a strong engineering background. References and illustrations under https://t.co/tDpo1jhMgT
Build the AI-controlled all-purpose robot!
Germany is losing out - partly because performance counts for too little. How can the country keep up in artificial intelligence (AI)? I have a suggestion.
Today's AI seems impressive. But it is nothing compared to what will come in the next 20 years. I said that 40 years ago and I was right. I said that 20 years ago and I was right. And I'm saying it again today, and indeed, in 20 years' time, everything that currently seems impressive will seem trivial in retrospect. A key component of this is AI-controlled general-purpose robots.
Today, everyone is talking about Generative AI and ChatGPT. What many people don't know: the current boom in "Generative AI" using artificial neural networks has its roots in the early nineties at the Technical University of Munich, especially the "G" and the "P" and the "T" in "ChatGPT." At that time, we published "Artificial Curiosity" through what's now called "Generative Adversarial Networks" (1990, now widely used) [AC90-20][DLH][DLP], self-supervised pre-training for deep learning with long texts (1991, the P in ChatGPT stands for "pre-trained") [UN][UN0-3][NOB], and unnormalized linear Transformers (1991, the T in ChatGPT stands for "Transformer") [ULTRA]. The Long Short-Term Memory [LSTM0-15][VAN1] — the most cited AI of the 20th century and "arguably the most commercial AI achievement" [AV1] — was also developed in my lab at TU Munich. Here is an overview including references [MOST]. At that time, Munich was also the birthplace of the first self-driving cars in traffic [AUT][DLH].
So why are the biggest beneficiaries of AI today not in Germany, but in America and China? Germany has messed this up itself. Research and art follow the money, but unfortunately Germany has spent almost nothing on AI since the 1990s compared to other countries.
Let's take a closer look: Germany has been going downhill since the late 1990s, and not just in AI. In 1995, according to the IMF, Germany still had 33% of the economic power of the USA in nominal terms; today it has only 16%. In 1995, Japan (the origin of foundational neural network breakthroughs of 1967-1988 [GD1-2][RELU1][AMH1][NOB][CNN1][CNN1a][CNN1a+]) had 72 percent of the economic power of the USA, today it is only 15 percent. In 1995, the two big losers of WW II together were economically stronger than the USA; today they are less than a third as strong.
In fact, according to the IMF, Japan (which had the five most valuable listed companies in the world in 1990) and Germany were roughly as strong economically in nominal terms as the much larger USA and China combined! See the title image. Since then, things have gone downhill in waves. By 2008, Germany only had 25% of the nominal economic power of the USA, but at least the EU as a whole was still as strong as the USA and China combined [EU08].
As a result of the financial crisis, however, Germany's taxpayers then lost enormous sums of money to the USA, for which the crisis was not really a crisis at all, but a major financial gain that led to the loss of importance of German and other European banks. Since 2015, the relative decline of Germany (and the EU) has progressed particularly rapidly. Since then, Germany has spent a lot of money on things that have yielded little and has become increasingly poorer and less significant compared to the USA. And as I said, research and art follow the money.
Germany was also much stronger militarily in the 1990s than it is today. And even in sport, the decline in performance became clear. Until 2006, German athletes often led the Olympic medal tables, especially at the Winter Games [OLY10]. Today, they are mostly among the "distant runners-up." Even at the Summer Games, Germany still won 90 percent of the gold medals won by the USA in 1992, but only 30 percent in 2024. Its small neighbor, the Netherlands, has more. Why is that? Because Germany abolished its performance incentives. Example: as a pupil, I found the incentive provided by the points system in the National Youth Games enormously motivating: I wanted to be the best in the class. It was one of many performance incentives that have been deleted by politicians.
I continue to train excellent young German researchers. But they often don't see any attractive opportunities in Germany afterwards. Instead, many want to go to the best-equipped foreign (mostly American) AI labs of the big platform companies, where they can earn 350,000 euros or more straight after their doctorate, much more than a German chair holder, who only receives a good 100,000 euros plus bonuses. In foreign AI laboratories, researchers are also provided with far more computing resources (very important in AI). They don't have to write research proposals and are still allowed to publish and make a name for themselves.
Unfortunately, it looks a bit like my home country no longer wants to or can't really keep up in this merciless global competition for outstanding talent. I can't tell you what a shame I think that is. The incentives are wrong: many of the best and most expensively trained specialists are leaving the country and are being replaced by others who can contribute little to the country's success with a lot of tax money and false incentives. It's a self-reinforcing vicious circle.
What immigration policy would a rational country pursue? One that raises the average in the country through appropriate incentives. If someone comes into the country who is richer than the average of those who are already there, the average wealth in the country increases and they are likely to pay more into the social systems than they take out. If he has a higher IQ than the average, the average IQ rises. If he is less criminal than the average, the number of crimes per inhabitant decreases. If he can speak German better than average, the German language skills per inhabitant increase. And so on. Many politicians do not understand these truly simple correlations and instead set well-intentioned but deeply counterproductive incentives that do not raise the important averages, but lower them, and thus harm the country.
Germany has already worked its way back up from much worse valleys. I therefore remain a cautious optimist. But fundamental changes are needed.
AI in the physical world is still in its infancy
The only AI that works well today is AI in the virtual world behind the screen, for example for automatically summarizing documents, creating images, programs and PowerPoint slides. The next big thing will be AI in the physical world. However, the latter is much more demanding than the world behind the screen. So today it is quite easy for AI to learn how to play chess, Go or video games superhumanly well. But there is no AI-controlled soccer-playing robot that can keep up with a little boy. There is no robot that can do what a plumber can do. That will come at some point, but AI software research is not enough, it has to be combined with the physical world of machines and robots. That’s why in 2014 - when compute was 100 times more expensive than today - we founded our AI company for the physical world. Alas, like some of our projects, it may have been a bit ahead of time, because the real world is very challenging. After all, passing the so-called "Turing Test" [TUR3,a,b][TUR21] is much easier than True AI in the physical world! But every 5 years, compute is getting 10 times cheaper, and smart robots are now starting to become a reality.
For every 10,000 AI software companies, there are perhaps only 10 AI robot companies. So the field is not yet overcrowded. With its strong mechanical engineering sector, Germany still has a chance of becoming an international leader. However, I already wrote this six years ago in my F.A.Z. article "AI is a huge opportunity for Germany" [FA18] (see also my earlier 2015 F.A.Z. article on intelligent robots [FA15]). The leader of the CDU/CSU parliamentary group at the time was Volker Kauder. He said that everyone had to read the article. I was invited to the Reichstag, where many famous politicians listened to what I had to say on the subject. I suggested investing a small number of billions for a world-class AI campus in an attractive city as a basis for further investment. The proposal fell on open ears and initial thoughts were given to this. A few weeks later, however, Kauder was no longer leader of the parliamentary group and everything came to nothing. While the major powers are now investing hundreds of billions in AI, Germany prefers to spend hundreds of billions on unemployed immigrants. I can only recommend that our politicians rethink the incentives in this country.
There is a huge opportunity in German mechanical engineering
What can Germany do to get back on its feet and avoid a further exodus of the best? How about a big, visionary yet realistic national project that, if successful, would have a truly world-changing impact? Namely: build an all-purpose robot that can learn to do all the jobs humans don't like!
In the not-too-distant future, someone will for the first time produce such intelligent (but not necessarily super-intelligent) robots at low cost, with which you can talk and interact and which you can teach something new without much prior knowledge (there are already approaches to this). A country with such versatile all-purpose robots would no longer need to worry about a shortage of skilled workers, secure pensions and unconditional basic income.
AI-controlled general-purpose robots would of course also be extremely exportable, as everyone would like to have them to do thousands of inconvenient jobs. And they would be extremely scalable: robots that can operate the tools and machines operated by humans can also build (and repair when needed) more of their own kind. I called this the ultimate form of scaling [JY24]. The country that, through a combination of private initiative, universities and industrial policy, is the first to produce such general-purpose robots will change world history. Let's go, Germany!
Very lucky to receive the ERC Consolidator this year! This is 5-year funding for groundbreaking research.
If you are interested in our perspective on software security analysis at scale, stick around and read on.
@ERC_Research#ERCCoG#MPI_SP @CASA_EXC
https://t.co/onAw5SBLfn
Hi all! I am looking to recruit 1-2 summer 2025 research interns to work with me at MPI-SP in Bochum, Germany, on topics relating to the ethics and harms of emerging technologies, including (but not limited to!) virtual reality, brain computer interfaces, and more! (1/2)
How does it feel like to do world-class research? If you are a CS undergrad who is interested in our topics, the Software Security group at #MPI_SP is hiring interns for summer & winter 2025!
Details:
📅 01 November 2024
✍️ https://t.co/rUmPIr8eFp
🛡️ https://t.co/TLd4TstNzd
It’s good to see the community raise the issue of extreme overfitting to evaluation benchmarks for LLMs. This is the real “Reflection” that the open-source community needs to have.