This is a needed and candid post by the Anthropic Institute. I agree with the conclusion that we need more time before we are hit with the “immense implications” of AI technology. My team at the Machine Intelligence Research Institute has worked to detail an international agreement (https://t.co/TiUxXfEPVf) which satisfies the requirements which are laid out by Anthropic: that a pause must include all frontier AI developers anywhere on Earth and must be mutually verified. Our contribution includes answering how to address the technological particulars of verifying a frontier AI development pause, and how to structure the agreement for stability and effectiveness.
Our work is a model, and we would welcome collaboration with Anthropic to further develop and refine it.
Some important points for enabling international coordination:
- We task governments, rather than labs, to coordinate and verify the pause, because they have the diplomatic and national intelligence means to do so and they can architect binding rules that apply to everyone.
- The United States is capable of halting frontier AI development globally, unilaterally and/or through coordination with key allies. While this is not preferred to a broadly coordinated halt, it strengthens the US’s hand in negotiating one.
Instead of regulating frontier AI development with a burdensome patchwork of conflicting national laws, we need a global governance framework banning it everywhere on earth
Developing a superintelligent AI that does what we want without killing everyone may be extremely difficult. In this video, we explain why, using arguments from "If Anyone Builds It, Everyone Dies" by @ESYudkowsky and @So8res.
Today is the 40th anniversary of the Chernobyl disaster. What can we learn from it?
Four lessons with important implications for AI (from IF ANYONE BUILDS IT, EVERYONE DIES, by @ESYudkowsky and @So8res):
>"1. 𝐀𝐧 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞 𝐢𝐬 𝐦𝐮𝐜𝐡 𝐡𝐚𝐫𝐝𝐞𝐫 𝐭𝐨 𝐬𝐨𝐥𝐯𝐞 𝐰𝐡𝐞𝐧 𝐭𝐡𝐞 𝐮𝐧𝐝𝐞𝐫𝐥𝐲𝐢𝐧𝐠 𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐞𝐬 𝐫𝐮𝐧 𝐨𝐧 𝐭𝐢𝐦𝐞𝐬𝐜𝐚𝐥𝐞𝐬 𝐟𝐚𝐬𝐭𝐞𝐫 𝐭𝐡𝐚𝐧 𝐡𝐮𝐦𝐚𝐧𝐬 𝐜𝐚𝐧 𝐫𝐞𝐚𝐜𝐭. Transistors switch even faster than neutrons multiply. Engineers can con- trive to make events run slow enough for humans to react, but if the contrivance fails the humans are back to being frozen statues, on the timescale that matters.
>2. 𝐀𝐧 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞 𝐢𝐬 𝐦𝐮𝐜𝐡 𝐡𝐚𝐫𝐝𝐞𝐫 𝐭𝐨 𝐬𝐨𝐥𝐯𝐞 𝐰𝐡𝐞𝐧 𝐭𝐡𝐞𝐫𝐞 𝐢𝐬 𝐚 𝐧𝐚𝐫𝐫𝐨𝐰 𝐦𝐚𝐫𝐠𝐢𝐧 𝐟𝐨𝐫 𝐞𝐫𝐫𝐨𝐫, 𝐞𝐬𝐩𝐞𝐜𝐢𝐚𝐥𝐥𝐲 𝐢𝐟 𝐢𝐭’𝐬 𝐚 𝐧𝐚𝐫𝐫𝐨𝐰 𝐦𝐚𝐫𝐠𝐢𝐧 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 “𝐮𝐧𝐢𝐦𝐩𝐫𝐞𝐬𝐬𝐢𝐯𝐞” 𝐚𝐧𝐝 “𝐞𝐱𝐩𝐥𝐨𝐬𝐢𝐯𝐞.” The analogy to intelligence is how apes and hominids wandered around for a few million years, and then got smart enough to set off a whole cascade of inventions: Agriculture led to writing led to science led to spacecraft. It would be a narrow target to make hominids that were intelligent enough to be profitable office workers, but not intelligent enough for explosive technological development.
>3. 𝐒𝐞𝐥𝐟-𝐚𝐦𝐩𝐥𝐢𝐟𝐲𝐢𝐧𝐠 𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐞𝐬, 𝐥𝐢𝐤𝐞 𝐚𝐧 𝐨𝐯𝐞𝐫𝐡𝐞𝐚𝐭𝐢𝐧𝐠 𝐫𝐞𝐚𝐜𝐭𝐨𝐫 𝐛𝐨𝐢𝐥𝐢𝐧𝐠 𝐨𝐟𝐟 𝐢𝐭𝐬 𝐜𝐨𝐨𝐥𝐚𝐧𝐭 𝐰𝐚𝐭𝐞𝐫 𝐚𝐧𝐝 𝐭𝐡𝐞𝐧 𝐨𝐯𝐞𝐫𝐡𝐞𝐚𝐭- 𝐢𝐧𝐠 𝐦𝐨𝐫𝐞, 𝐥𝐞𝐚𝐯𝐞 𝐥𝐢𝐭𝐭𝐥𝐞 𝐫𝐨𝐨𝐦 𝐟𝐨𝐫 𝐞𝐫𝐫𝐨𝐫. And nuclear engineers don’t even have it that bad, compared to artificial superintelligence developers. Nuclear reac- tors that get too hot don’t start intelligently redesigning themselves to increase their own reactivity rate. Overheating nuclear reactors don’t start trying to fool the operators into complacency until the reactor is ready to fully explode.
>4. 𝐂𝐨𝐦𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬 𝐦𝐚𝐤𝐞 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐩𝐫𝐨𝐛𝐥𝐞𝐦𝐬 𝐰𝐨𝐫𝐬𝐞. Chernobyl Unit 4 managed to get into a weird state where lowering the control rods caused the reactor to explode. No engineer designed for that. The operators didn’t know that something unusual would happen if the reactor had been operating at low power for a while and some of the water had been shut off. And they had never seen the reactor’s state change that fast. The complicated internals of a nuclear reactor have nothing on the unknown complications that lurk in the hundreds of billions of weights that make up a modern LLM.
>From these lessons in combination, we infer an additional lesson for engineers: If someone doesn’t know exactly what’s going on inside a complicated device subject to all these curses—speed, narrow margins, self-amplification, complications—then they should stop. They should shut it down immediately, the moment the behavior looks strange; don’t wait until the behavior becomes visibly concerning.
>The operators at Chernobyl knew about delayed neutrons and prompt neutrons. They knew that a nuclear reactor walks a line a fraction of a percent wide between life and death. They knew the theory saying that a reactor’s apparently human-manageable timescale is an artifice, a clever contrivance that hides neutron generation times measured in microseconds.
>A wise operator treats a device like that with respect. If the device starts behaving in any way odd or unexpected, then it is no longer operating inside the narrow, constrained region where they are sure they understand exactly what is going on. Which means that nobody knows what’s going on inside there anymore. Who knows whether the clever contrivances will keep working? They can only guess. When a dangerous device starts acting strangely, it is not time to withdraw all but eight control rods and expect the reactor to keep playing nice. It is time to shut it down.
>The operators did not treat the reactor with that sort of respect. They knew, intellectually, that it could explode, but they had never seen the reactor change that fast. Besides, before 1986, the Soviets did not have a culture conducive to caution around nuclear reactors. They had a system where, if you didn’t perform the scheduled safety test, you got fired.
>(In the coming chapters we’ll discuss the lack of safety culture prevailing in AI, which is much worse.)"
@TheZvi Usually great in a fresh session where I have a complex coding task for it. But for in-the-loop drudgery and longer sessions, it's much more likely than 4.6 to ignore explicit instructions or take lazier approaches.
While it's still fresh, pay attention to how it feels for a single training run to have just made the difference between "speculative risk" and "serious threat that the world is unprepared for."
And take a moment to be thankful that this particular threat is one that can be mitigated by choosing to not deploy it (and that the first ones to reach this capability are choosing to not deploy it).
If the industry succeeds at creating agentic machines that are capable enough to outsmart and outmaneuver humanity, choosing not to deploy won't be good enough. They'll just deploy themselves.
Star Trek: The Motion Picture (1979):
Members of a sluggish “carbon-based infestation” must save Earth from a budding superintelligence while caught between two of its orifices.
(Remarkably prescient!)
@HumanHarlan I (unfortunately) think it's because people like to cover the story in way where they don't have to be up front about this being an experiment under lab conditions.
Murder is so "big, if true" that there's no avoiding that caveat, and caveated tellings are less viral.
accs say that, while a pause would be good, the US unilaterally pausing wouldn't fix the problem. Decel doomers, on the other hand, say that, while a pause would be good, the US unilaterally pausing wouldn't fix thr problem
Many experts warn that jumping off an emormous cliff without a parachute could lead to increased health insurance premiums, getting lost in the wilderness, or - in extreme cases - even death.
Many of the ways this can go wrong are much derpier than "an engineer fumbled an esoteric alignment obstacle". Those obstacles would be suffient to kill us, but we're not on track to die from the dignified obstacles. Not even close. Shut it all down.
Every AI lab is working to make their AI helpful, harmless and honest.
Max Harms (@raelifin) thinks this is a complete wrong turn, and 'aligning' AI to human values is actively dangerous.
In his view a safe AGI must have absolutely no opinion about how the world ought to be, be willingly modifiable, and be entirely indifferent to being shut down. The opposite of all commercial models today.
The key appeal is that so-called 'corrigibility' could be an attractor state – get close enough and the AI actively helps you make it more corrigible over time. That forgiveness would at least give us a shot.
It's a strategy that feels natural within the 'MIRI worldview', recently laid out by his colleagues @ESYudkowsky and @So8res in 'If Anyone Builds It Everyone Dies'.
But it risks causing a different AI catastrophe, because the resulting AI model would necessarily be willing to assist any human operator with a power grab, or indeed any crime at all.
I interviewed Max on the 80,000 Hours Podcast to debate the MIRI worldview, and what we should do to figure out if corrigibility ought to be our one and only focus. Links below – enjoy!
00:01:56 If anyone builds it, will everyone die? The MIRI perspective on AGI risk
00:24:28 Evolution failed to ‘align’ us, just as we'll fail to align AI
00:42:56 We're training AIs to want to stay alive and value power for its own sake
00:52:24 Objections: Is the 'squiggle/paperclip problem' really real?
01:05:02 Can we get empirical evidence re: 'alignment by default'?
01:10:17 Why do few AI researchers share Max's perspective?
01:18:34 We're training AI to pursue goals relentlessly — and superintelligence will too
01:24:51 The case for a radical slowdown
01:27:53 Max's best hope: corrigibility as stepping stone to alignment
01:32:34 Corrigibility is both uniquely valuable, and practical, to train
01:45:06 What training could ever make models corrigible enough?
01:51:38 Corrigibility is also terribly risky due to misuse risk
01:58:57 A single researcher could make a corrigibility benchmark. Nobody has.
02:12:20 Red Heart & why Max writes hard science fiction
02:34:08 Should you homeschool? Depends how weird your kids are.
"I'm quite worried that some quite smart people may start to think they have solved very hard problems that they have not in fact solved."
"If, say, one of those problems is control or alignment of extremely powerful AI systems, and if those people are the ones in charge of them, and working closely with them to collaborate on those solutions, well then I think we've got a real problem."
(Long) PSA on using AI for hard intellectual work. At significant risk of being immodest: I've spend about 30 years as a theoretical physicist, engaged with some of the most challenging questions humankind has grappled with. I've gotten to work with some great collaborators on new ideas (like past-eternal inflation, colliding bubble universes, the cosmological interpretation of QM, and observational entropy) that I'm pretty proud of. I've engaged at length and depth with the absolute top minds in the field. I've mentored many students, some of them brilliant. I think it's fair to say I have a good sense, in physics and closely related fields, as to what is top-notch, interesting thinking, and who's got talent. So what do I think about today's AI?
It's very smart. Whatever its "inner experience" may or may not be (currently I think "not be"), it understands things – things that are difficult to understand – by any reasonable operational definition of "understand." It understands things better, and thinks more clearly, than most people – including some physicists I know! It's very good at quite substantive math: better than I am and way, way, way faster. (It does do some surprisingly dumb things; people do too.) Anyone who thinks these systems are dumb, or "not reasoning" or still "stochastic parrots" is not looking at them objectively.
But: at the really conceptually hard things, and at creating really new ways of looking at things, current AI doesn't just fall short on its own. And it doesn't just fail to help. I think it's actively dangerous. There is something almost sinister going on, though I don't think it is intentional.
When you're trying to work out something new and hard, and really break new ground, you should be frustrated! You should be pacing, and walking up to that chalkboard, frowning, and sitting down again, shaking your head. You should be waving your hands because you can't quite get it clear enough. You should feel like you're hitting a wall, over and over, before – maybe – you finally break through, or go over or around. It may take hours, or days, or weeks, or never happen.
It should not feel easy. It may not even feel "good" most of the time (though it can be fulfilling and compelling.) But AI systems – ah, AI systems are trained so that it feels so good, and so easy. Doesn't it? It's fun. You're making fast progress. So much faster than without it. It's like the ideas are moving in slow motion. You're so smart. You're even properly skeptical, you even ask the AI to push back on your ideas, good job!
It's an illusion. It's that simple. The systems are smart, yes. But not quite as smart as they seem, and much more importantly, they don't make you as smart as you feel. That feeling is something they have learned to give you. When working with these systems have to keep in the front of your mind what they are rewarded for doing. It's a lot of things, but perhaps foremost is making the user feel good.
So:
- If you're getting your AI system to do order-of-magnitude calculations for you: awesome, do it. It's so great. Have fun.
- If your AI system is searching up and summarizing literature for you: fantastic, it's so helpful, total capability unlock.
- If it's teaching you some well-understood (by others) piece of knowledge, go for it, learn it up!
- If you've got some giant document, or piece of code, that you're wrangling, AI can help – work that million token context window!
But:
- If you and your AI system have finally cracked how quantum interpretation really works;
- If you've cracked quantum gravity;
- If you've attained an awesome new insight into the deep structure of the world that nobody else has;
- If you've cracked AI alignment...
You didn't.
The hard unsolved problems stand hard and unsolved because the best humans have not solved them yet. AI is making top human thinkers able to do more, and more effectively. I do not believe it is helping them do things they fundamentally could not do before. That includes you. If you couldn't do it without AI, you probably can't do it with AI. If the time comes – whether sooner or later – when these AI systems are really clever enough to get you there, they won't need you. Sorry; it won't be you solving those problems. Will you even be able to tell if the solutions are correct, or flawed in some way? Maybe sometimes – I really don't know.
Why am I going on about this? It's not so that I can get less emails about people who have created a new unified field theory with AI help (though that would be nice.) It's because I'm quite worried that some quite smart people may start to think they have solved very hard problems that they have not in fact solved. For the most part that's going to be more annoying and confusing than dangerous. But if the problem is really important, then it is.
If, say, one of those problems is control or alignment of extremely powerful AI systems, and if those people are the ones in charge of them, and working closely with them to collaborate on those solutions, well then I think we've got a real problem.