Evals are great, but they only catch deception if it's clearly present in a model's output. Instead, we should audit the model's internal mechanisms, even if the output itself looks normal. This is the problem of mechanistic anomaly detection, the focus of our new ICML paper. 🧵
We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵
@promilos Yes, we do have to define 'normality' in a potentially fallible way using data. Check out the trusted set contamination experiments in Appendix E.2 for more on how the method deals with this and empirical performance.
Evals are great, but they only catch deception if it's clearly present in a model's output. Instead, we should audit the model's internal mechanisms, even if the output itself looks normal. This is the problem of mechanistic anomaly detection, the focus of our new ICML paper. 🧵
Overall, we're excited about decorrelated approaches to white-box monitoring and think this is an important direction for AI safety work. Read our paper here: https://t.co/RevIUGPGES and come say hi at ICML! 8/8
Our method has some caveats: it is more expensive than single pass latent space methods, and requires BYO trusted samples to build up the picture of what "normality" looks like, mechanistically. 7/8
"the operation is very likely to go okay on your son"
"what do you mean, 'very likely'?"
"well it's hard to know with operations, there is a lot of disagreement about the probabilities"
"what probabilities do people give?"
"well personally I think this operation is likely to succeed. some people think it's certain, others think this operation never works. among top surgeons, the median view is that it goes bad about 5% of the time"
"wait, this operation goes wrong maybe 5% of the time?"
"that is one view, as I say, it's hard to know"
"and what does 'goes bad' mean"
"it means death. Or worse"
"DEATH. You're telling me the median top surgeon thinks there's a 5% chance operation will kill my son."
"yes, or worse. but that is the median view. there's a pretty broad range of opinion on the subject."
"okay, but the operation is necessary, right?"
"no. but it might be extremely beneficial. It might improve your son's quality of life by two or ten or a hundred fold. It's very hard to know."
"can we delay it? can he have this operation next year?"
"we are really quite excited about doing this operation now. we think it would be good for our hospital and we reckon we need to gain experience in doing this kind of procedure. otherwise the expertise might go elsewhere. also if it works we want to help many other people soon"
"so you want to do it on my son?"
"yes. he's a great test candidate"
"now?"
"we still have a few kinks to work out, but as soon as possible, if that's okay"
To me, this is the situation we are in with AI. Except it isn't someone's son, it is everyone's son. Everyone's daughter. All of us.
Maybe the operation goes well, and we have far better quality of life. Cure disease, massive wealth, new experiences. I think most people underrate how good this looks.
Maybe it goes badly and we die. AI researchers think there is a 5% chance it kills us all. Thousands were surveyed. The typical one thought it was 5%, some thought lower, some higher. The AI CEOs have basically all agreed with this (or thought it was riskier).
I am not arguing that "AI is good" or "AI is bad" or even "we need to take a balanced approach". I argue it is worth paying attention to. Worth understanding.
If someone were operating on my child and said they didn't know how it would go, I'd want to understand the operation. That's why I want to understand AI.