Hugo @Hugo007600 - Twitter Profile

Pinned Tweet

about 1 month ago

Evals are great, but they only catch deception if it's clearly present in a model's output. Instead, we should audit the model's internal mechanisms, even if the output itself looks normal. This is the problem of mechanistic anomaly detection, the focus of our new ICML paper. 🧵

Hugo007600's tweet photo. Evals are great, but they only catch deception if it's clearly present in a model's output. Instead, we should audit the model's internal mechanisms, even if the output itself looks normal. This is the problem of mechanistic anomaly detection, the focus of our new ICML paper. 🧵 https://t.co/EMT4k7o8lY

3

144

16

114

7K

Hugo007600 retweeted

Geoffrey Irving

@geoffreyirving

9 days ago

We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵

geoffreyirving's tweet photo. We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵 https://t.co/UziUGbIPdU

28

972

148

426

199K

Hugo007600 retweeted

Hesam Asadollahzadeh @HesamAsdz

about 2 months ago

My first PhD paper is now accepted @icmlconf 🥳 #ICML2026

11

181

2

12

9K

Hugo @Hugo007600

about 1 month ago

@promilos Yes, we do have to define 'normality' in a potentially fallible way using data. Check out the trusted set contamination experiments in Appendix E.2 for more on how the method deals with this and empirical performance.

1

2

0

49

Who to follow

Aidan O’Gara

@aidanogara_

Aligning the technocapital machine. Doctoral student in AI at Oxford and grantmaker at Longview.

Ram Bharadwaj

@arbdwj

AI safety research fellow @lasrlabs. Prev @lossfunk

Michael Huang ⏸️

@michhuan

Reduce extinction risk by pausing frontier AI unless provably safe @pauseai and banning AI weapons @bankillerrobots | Reduce suffering @postsuffering | #COYS

Hugo @Hugo007600

about 1 month ago

Evals are great, but they only catch deception if it's clearly present in a model's output. Instead, we should audit the model's internal mechanisms, even if the output itself looks normal. This is the problem of mechanistic anomaly detection, the focus of our new ICML paper. 🧵

3

144

16

114

7K

Hugo @Hugo007600

about 1 month ago

Overall, we're excited about decorrelated approaches to white-box monitoring and think this is an important direction for AI safety work. Read our paper here: https://t.co/RevIUGPGES and come say hi at ICML! 8/8

0

18

1

8

534

Hugo @Hugo007600

about 1 month ago

Our method has some caveats: it is more expensive than single pass latent space methods, and requires BYO trusted samples to build up the picture of what "normality" looks like, mechanistically. 7/8

1

8

0

1

524

Hugo007600 retweeted

Nathan 🔎

@NathanpmYoung

about 1 year ago

"the operation is very likely to go okay on your son" "what do you mean, 'very likely'?" "well it's hard to know with operations, there is a lot of disagreement about the probabilities" "what probabilities do people give?" "well personally I think this operation is likely to succeed. some people think it's certain, others think this operation never works. among top surgeons, the median view is that it goes bad about 5% of the time" "wait, this operation goes wrong maybe 5% of the time?" "that is one view, as I say, it's hard to know" "and what does 'goes bad' mean" "it means death. Or worse" "DEATH. You're telling me the median top surgeon thinks there's a 5% chance operation will kill my son." "yes, or worse. but that is the median view. there's a pretty broad range of opinion on the subject." "okay, but the operation is necessary, right?" "no. but it might be extremely beneficial. It might improve your son's quality of life by two or ten or a hundred fold. It's very hard to know." "can we delay it? can he have this operation next year?" "we are really quite excited about doing this operation now. we think it would be good for our hospital and we reckon we need to gain experience in doing this kind of procedure. otherwise the expertise might go elsewhere. also if it works we want to help many other people soon" "so you want to do it on my son?" "yes. he's a great test candidate" "now?" "we still have a few kinks to work out, but as soon as possible, if that's okay" To me, this is the situation we are in with AI. Except it isn't someone's son, it is everyone's son. Everyone's daughter. All of us. Maybe the operation goes well, and we have far better quality of life. Cure disease, massive wealth, new experiences. I think most people underrate how good this looks. Maybe it goes badly and we die. AI researchers think there is a 5% chance it kills us all. Thousands were surveyed. The typical one thought it was 5%, some thought lower, some higher. The AI CEOs have basically all agreed with this (or thought it was riskier). I am not arguing that "AI is good" or "AI is bad" or even "we need to take a balanced approach". I argue it is worth paying attention to. Worth understanding. If someone were operating on my child and said they didn't know how it would go, I'd want to understand the operation. That's why I want to understand AI.

35

519

72

89

47K

Hugo007600 retweeted