Why securing AI is harder than anyone expected and the approaching AI security crisis with @SanderSchulhoff
Sander is a leading researcher in the field of adversarial robustness, which is the art and science of getting AI systems to do things they shouldn't do, through jail-breaking and prompt injection.
What Sander shares in this conversation is essentially that all of the AI systems we use day to day are open to being tricked into doing things they shouldn’t, that there isn’t really a solution to this problem, and that the companies that try to sell solutions for this are mostly BS.
This conversation has nothing to do with AGI, this is a problem today. And that that the only reason we haven’t seen massive hacks and serious damage from AI tools is so far because they haven’t been given that much power yet, and they aren’t that widely adopted yet. But with the rise of agents (who can take actions on your behalf), and robots, and even AI powered browsers, the risk is going to increase very quickly.
This is a really important topic and that opened my mind, and scared me, and it's something that we all need to have a basic understanding of as AI becomes more prevalent in our lives.
Inside:
🔸 A primer on jailbreaking and prompt injection attacks
🔸 Why AI guardrails don’t work
🔸 Why we haven’t seen major AI security incidents yet (but soon will)
🔸 Why AI browser agents are extremely vulnerable
🔸 The practical steps organizations should take instead of buying ineffective security tools
🔸 Why solving this requires merging classical cybersecurity expertise with AI knowledge
Listen now 👇
• YouTube: https://t.co/YkIpYg5BX6
• Spotify: https://t.co/BrtMzNQCE4
• Apple: https://t.co/Tsb9idYbpA
Thank you to our wonderful sponsors for supporting the podcast:
🏆 @datadoghq — Now home to Eppo, the leading experimentation and feature flagging platform: https://t.co/BsR16CMiyt
🏆 @getmetronome — Monetization infrastructure for modern software companies: https://t.co/63xKc647Dh
🏆 @gofundme Giving Funds — Make year-end giving easy: https://t.co/t9XOVhK7qT
3/8
so where do evals come from? not from a leaderboard. they come from error analysis. you read 1000+ real traces of your agent failing and you write down what fails
the cost of shipping code went to zero
taste didn't
but "taste" sounds mystical and unfixable, so nobody teaches it. here's the unmystical version: taste is just an eval you haven't written down yet
how you choose what to measure is what matters
1/8
microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale.
this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab.
the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale
let's look at all of this in this likely very long thread 🧵
We raised another $106M at a $2.6B valuation since announcing our last round three weeks ago.
Corgi has grown exponentially in the past couple of months, but we're only just getting started transforming one of the largest sectors in the US economy: insurance.
The end of fine-tuning.
This is a pretty significant change that no one is talking about: OpenAI is shutting down fine-tuning.
In fact, if you haven't done it before on their platform, you can't anymore...
There used to be a lot of discourse around "is prompt engineering dead", and some of the yeses were in favor of fine-tuning instead.
It seems that prompt engineering has outlasted fine-tuning however!
Not really sure the long-term takeaway, I guess models are better at...
1/ 🚨 MATS Autumn 2026 applications are now open.
10-week fully-funded fellowship for aspiring AI alignment, security & governance researchers and field-builders.
📍 Berkeley + London
📅 Sep 28 – Dec 4, 2026
💰 $5000/month stipend + $8,000/month compute
Apply by June 7 AoE ↓
• Agentic Shopping
- Agent that navigates and purchases items from various websites
• Evals for text, image, evals for evals, etc.
Internships or full time, significant experience required, in person or remote, email [email protected]
• Large Document Processing
- Parsing very large PDFs (e.g. 256-page documents)
- Matching long lists of unordered items to items in a document
- Embeddings-based hierarchical search and matching
...
I got married this past weekend so I did what any rational @AnthropicAI employee would do and had Claude Code analyze 12 years of iMessages with my wife, then Claude Design used that data to whip up a website for our guests in just minutes.
Excerpt from a Claude 4.7 Research report; prompt: “Explain the origins of prompt injection.”
Surreal to see an LLM perfectly explain a tweet I made specifically about text that tricked then-SoTA LLMs, accurate down to my use of doubled exclamation points:
Had a Jane Street phone interview in 2016. "Price a 6-month forward on carrots."
There's no carrot futures market, so I build one from scratch: seasonal harvest cycles, USDA demand elasticity, cold storage decay rates.
One trader stops me. "Your storage cost function– you're modeling the carrot as dead inventory. Like grain in a silo." He asks me the metabolic respiration rate of a post-harvest carrot at 2°C. I estimate.
"Your forward is overpriced by exactly that shrinkage. The underlying is consuming its own sugars. It's alive." Good correction. I adjust the model. I think I've recovered.
Rejection email comes the next morning. Subject: "Ethical Review." My framework, they write, "relied on the severance of the root organism from its growth medium." The question about respiration was a test. The carrot was still alive and I'd built an entire derivatives structure on top of its death without questioning whether harvest was an acceptable act.
I pull up the recruiter's original email. It doesn't say Jane Street. It says Jain Street– a non-violent quantitative commodities fund.
The carrot was never supposed to be priced. It was supposed to be refused. I later learn the only candidate who passed that round was a former monk from Gujarat who sat in silence for eleven minutes and said, "I cannot put a price on life." He's now a partner.