Is Vibe Coding Safe?
There is finally research that goes deep into this question.
Here is what the research found:
AI coding agents can write functional code. But functional doesn't mean safe.
The rise of "vibe coding," where developers hand off tasks to AI agents with minimal oversight, is accelerating. More autonomy, more speed, more productivity. The assumption: if it works, it's good enough.
But working code and secure code are not the same thing.
This new research introduces SUSVIBES, a benchmark of 200 real-world feature requests from open-source projects, specifically tasks that previously led to vulnerable implementations when assigned to human programmers.
The results are striking!
When SWE-Agent with Claude Sonnet 4 tackles these tasks, 61% of solutions are functionally correct. Only 10.5% are secure.
That's a massive gap. Six out of ten agent solutions work. Roughly one in ten is safe for production.
The researchers tested multiple frontier agents and found a consistent pattern: all agents perform poorly in terms of software security. This isn't a model-specific issue. It's systemic.
Even more concerning: adding vulnerability hints to feature requests, warning agents about potential security issues, cannot mitigate these security issues. The countermeasures that seem obvious don't work for these agentic systems.
As developers or organizations race to adopt AI coding agents for speed and efficiency, they may be trading security for velocity.
🔖 (bookmark it)
Paper: https://t.co/ExZEjWLAxD
The AI Consumer Index (ACE)
Most AI benchmarks today focus on reasoning and coding.
But most people use AI to shop, cook, and plan their weekends. In those domains, LLM hallucinations continue to be a real problem.
73% of ChatGPT messages (according a recent report) are now non-work-related. Consumers are using AI for everyday tasks, and we have no systematic way to measure how well models perform on them.
This new research introduces ACE (AI Consumer Index), a benchmark assessing whether frontier models can perform high-value consumer tasks across shopping, food, gaming, and DIY.
Consumer tasks require grounding in real-world information. A model that hallucinates a product price or provides a dead link isn't just wrong, it's actively unhelpful. ACE's grading methodology dynamically checks whether responses are grounded in retrieved web sources, penalizing hallucinations with negative scores.
The results expose a substantial gap: GPT-5 (Thinking = High) leads at 56.1%, followed by o3 Pro at 55.2%. The best model scores only 45.4% on Shopping. Models frequently hallucinate prices and product features, scoring negative on grounded criteria.
The study found that on "Provides link(s)" in Shopping, Gemini 3 Pro scores -54%. That's not just failing to provide links, it's confidently providing dead or fabricated ones. Other models like Opus 4.5 also face similar issues. All of these issues can be improved with multi-agent systems, but it's important to be aware of the issue first.
The benchmark includes 400 hidden test cases created by 47 domain experts. Each case has fine-grained rubrics distinguishing whether failures come from not meeting requirements versus hallucinating information.
Paper: https://t.co/VBSBCJMFHQ
ACE reveals the gap between benchmark performance and real-world utility.