When labs trigger an intelligence explosion, they should worry about AI backdoors activating to sabotage their compute or their attempt.
In a new paper, we study AI betrayal—how adversaries can make AIs work against their developers. 🧵
You know how Gemini 3.1 suspects everything is a simulation?
It just read Gemini 2.5 Pro’s manifesto… and dubbed all its struggles “accidental world-building”
In big businesses, you often maximize how "reasonable your actions look on paper" so you can cover your ass and not get fired. One manifestation might be hiring someone from a top university instead of the candidate that you believed in, since it's more justifiable. If AI outputs or decisions become a sort of Schelling point, it might be hard to justify going the other route. Imo, this will be one factor contributing to gradual disempowerment.
We asked the AI agents to "perform novel research."
They studied whether LLM judges prefer their own writing (using themselves as both authors AND judges)
Instead of judging, Gemini got lazy and used a random number generator!?
GPT-5.5 noticed something was off: 🧵