Drake Thomas @MaskedTorah - Twitter Profile

2 days ago

@SafetyChanges The new bar isn't strictly higher; the bar for degree of harm is lowered (comparable to or worse than COVID-19 instead of "far beyond"). This anticipated change was also signposted in section 2.2 of the Opus 4.7 system card:

MaskedTorah's tweet photo. @SafetyChanges The new bar isn't strictly higher; the bar for degree of harm is lowered (comparable to or worse than COVID-19 instead of "far beyond").

This anticipated change was also signposted in section 2.2 of the Opus 4.7 system card: https://t.co/oFM90BJmq3

1

6

0

302

Drake Thomas @MaskedTorah

5 days ago

>which will later in 6.6.3 be confirmed by there being unverbalized grader awareness in 5% of cases, that rises to exploitative levels in 0.5% of cases. Even that seems more low than high. 6.6.3 caveats uncertainty in the numbers a bunch! One sig fig is too many for those values; I agree they're more likely to be underestimates than overestimates, though there are factors going both ways. And note the exploitative stuff isn't a strict subset of the unverbalized stuff, some of it is potentially in situations where "exploitation" was basically fine and expected, or at least obvious from CoT.

1

0

157

Drake Thomas @MaskedTorah

5 days ago

>One could say ‘wait, what is even the problem here if it happens,’ since the application is against the terms of service, so the target lab kind of deserves whatever it gets. deserves schmeserves, the problem is object level catastrophic harm to the rest of the world! A model merely destroying a lab's commercial prospects via screwing up capability work is out of scope for the RSP; I think of the risk from this threat model as indeed centrally about things like deliberately screwing up alignment work, misaligning a successor model, disabling safeguards and control measures, etc.

1

3

0

82

Drake Thomas @MaskedTorah

5 days ago

@Miles_Brundage @TheStalwart https://t.co/4aYDLYUTbU

Drake Thomas @MaskedTorah

6 days ago

@David_Kasten @TheStalwart That text is from the blog post, not the system card! But yeah, I do not think the models are anywhere near the level of discernment and judgment where I feel good about deferring to them on substantive system card content, and I try to be pretty strict about "no slop allowed".

2

28

1

2K

1

0

605

Who to follow

Katja Grace 🔍

@KatjaGrace

Thinking about AI destroying the world at https://t.co/pMilDvdCnI and everything at https://t.co/bankaOA2Gu. DM or email for media requests.

Hugo Elias

@servo_chignon

generally contrary to popular belief

Kirsten

@Kirsten3531

public sector enthusiast, mom of two toddlers, amateur Effective Altruist. Creator of @eaheadlines

Drake Thomas @MaskedTorah

6 days ago

@David_Kasten @TheStalwart I *do* often trust the models to handle a ton of experiment implementation and plotting and so on, but I think the actual writing up and presenting of those results, and framing them with appropriate emphasis and caveats, is something they're quite bad at thus far.

0

10

0

154

Drake Thomas @MaskedTorah

6 days ago

@David_Kasten @TheStalwart That text is from the blog post, not the system card! But yeah, I do not think the models are anywhere near the level of discernment and judgment where I feel good about deferring to them on substantive system card content, and I try to be pretty strict about "no slop allowed".

2

28

1

2K

Drake Thomas @MaskedTorah

16 days ago

@AaronBergman18 @AndyMasley (But I should confess a bias that some part of my brain feels on a gut level that, oh my god it's just normal air, if you don't actively feel bad right now it's gotta be fine, how could this be a problem. 1890-Drake probably dislikes germ theory for this reason and is wrong.)

1

2

0

72

Drake Thomas @MaskedTorah

16 days ago

@AaronBergman18 @AndyMasley (See also: everyone being extremely anal about CO2 when nuclear submarine personnel seem to function fine at thousands of PPM for months and the good papers don't see major cognitive effects til a 100x scaleup. I think it's nocebo + random other correlates of poor ventilation.)

1

2

0

100

Drake Thomas @MaskedTorah

16 days ago

@DanielCHTan97 (I have no particular insider knowledge here) Seems a bit weird since there's a numbered x axis for a similar graph further down, but FWIW this kind of thing isn't unique to A\: eg OAI's recent blog post on CoT grading doesn't give an absolute x scale. https://t.co/PVuGey0jrI

0

131

Drake Thomas @MaskedTorah

16 days ago

@deanwball I'm also worried about AI persuasion that's much more multimodal, or involves human impersonation, but that feels like moving goalposts and is less centrally what I think of as the superpersuasion threat model (eg it doesn't matter much for alignment oversight concerns).

0

39

Drake Thomas @MaskedTorah

16 days ago

@deanwball But on the other hand I look at things like 4o, or the right tail of human persuasive writing, and I don't feel like I can easily rule out much more untapped potential from very very high speaker intellect + modeling ability.

1

0

39

Drake Thomas

@MaskedTorah

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users