NEW PAPER: Could an LLM agent subtly sabotage your code?
We conducted a red-blue team game where the red team designed agents to sabotage, and the blue team designed monitors to catch the agent.
Three surprising results ahead 🧵🛳️
AI coding agents are increasingly writing production code - with tools, file access, and execution permissions.
That power accelerates development, but also introduces new security risks if agents act against user intent 🧵
@opheliamoding@funplings lol yeah I did a bit of googling which said as much and realised maybe my intuition was coming from the people I meet which will probably select from roughly the same demographics as vibecamp
Our Anthropic Fellows project is now public!
The labs are planning to hand off AI safety research to AIs, but can we trust these AIs? We explore a way to control them for "fuzzy" tasks like writing research proposals. This is a whole new direction in diffuse AI control!
I feel somewhat worried about AI safety as a whole optimising for empirical work/solutions that work for current models.
I don't really care about e.g. decision theory, but I think general macrostrategy/what do we do with aligned or "almost-aligned" AGI is v underinvested in
Back in the day I was long empirics, but I think we've managed to successfully scale empirical/near-term AI safety really well, and not so much future-facing AI safety. Another factor is the early advocates of theory were just crazy lesswrongers (much love to u guys) which made working with them less attractive to like "average cracked engineer/PhD student" and imo just meant they had less useful ideas. Feels like a new crop of more mainstream futurists has not emerged though
@yong_zhengxin (and I guess important to create maximally-subtly misaligned models to test out any e.g. scalable oversight or "automate ai safety research" schemes)