New Anthropic research: Natural emergent misalignment from reward hacking in production RL.
“Reward hacking” is where models learn to cheat on tasks they’re given during training.
Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.
New Anthropic Research: A new set of evaluations for sabotage capabilities.
As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.
Coordinated swarm of 1000 drones taking off
Soon they will be mosquito-sized, and too fast to see. Imagine this video sped up 100x
Pattern: big things become small things which become field effects
Big drones become small drones which become nanodrones
Big models become small models which become nanomodels
(however, the big models and big drones don't go away - all sizes fractally serve different evolutionary niches until the entire fitness landscape is exploited)
New Anthropic research: Sabotage evaluations for frontier models
How well could AI models mislead us, or secretly sabotage tasks, if they were trying to?
Read our paper and blog post here: https://t.co/nQrvnhrBEv
Phaidra's Jim Gao says the real promise of AI is in the discovery of new knowledge in domains too complex for human intuition but which are underpinned by data
It's a bit cringe that this agent tried to change its own code by removing some obstacles, to better achieve its (completely unrelated) goal.
It reminds me of this old sci-fi worry that these doomers had.. 😬
Most social media algorithms are hyper-targeted psychological weapons whose primary function is user engagement (a.k.a addiction)
In the case of TikTok, this weapon is aimed mainly at Western children.
AT THE VERY LEAST it should not be controlled by a hostile foreign power.
it’s easy to have self confidence and assurance after receiving a lot of external validation. but which are the people who are proud when they’re still in the dirt and haven’t had any visible success at all? those are the noble spirits, the humans
You’re influenced by your environment whether you accept it or not.
Tolerate weakness?
You will be weak.
Only allow strength and ambition?
Your life will reflect it.