New paper, following up on our chain-of-thought faithfulness work from a few months ago, about how we can make sure that LLM thoughts are staying faithful and monitorable.
CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes.
Would we even notice if it starts slipping away? 🧵
New paper showing that when LLMs chew over tough problems, they tend to think clearly and transparently -- making them easier to monitor for bad behavior ⬇️
Is CoT monitoring a lost cause due to unfaithfulness? 🤔
We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes!
Our finding: "When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors." 🧵
We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI.
Research Engineer
https://t.co/KUJwTIRFhm…
Research Scientist
https://t.co/MiKcPdT8n4
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?
Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Inspired by myopic optimization but better performance – details in🧵
@wholemars My first day with 12.5.4.1, I had to disengage twice when it took wrong turns vs navigation, and once when it emergency stopped at a flashing yellow near a police station.
Then that night it stopped forever at a “red light” not realizing that it was actually a street lamp.
@peterrhague Given that a parasol at Earth-Sun L1 might save entire ecosystems and many lives, you’d think all of humanity would be behind Starship development.
@elonmusk@wholemars NYC-Albany: Nothing on Taconic Parkway between Clinton Corners, which itself needs a refresh, up to Albany (Crossgates Mall is 76 miles away).
Nearby I-87 is a little better, but Hudson, NY is a detour - Superchargers at all the rest stops on this route would be 🔥