@SafetyChanges The new bar isn't strictly higher; the bar for degree of harm is lowered (comparable to or worse than COVID-19 instead of "far beyond").
This anticipated change was also signposted in section 2.2 of the Opus 4.7 system card:
>which will later in 6.6.3 be confirmed by there being unverbalized grader awareness in 5% of cases, that rises to exploitative levels in 0.5% of cases. Even that seems more low than high.
6.6.3 caveats uncertainty in the numbers a bunch! One sig fig is too many for those values; I agree they're more likely to be underestimates than overestimates, though there are factors going both ways. And note the exploitative stuff isn't a strict subset of the unverbalized stuff, some of it is potentially in situations where "exploitation" was basically fine and expected, or at least obvious from CoT.
>One could say βwait, what is even the problem here if it happens,β since the application is against the terms of service, so the target lab kind of deserves whatever it gets.
deserves schmeserves, the problem is object level catastrophic harm to the rest of the world! A model merely destroying a lab's commercial prospects via screwing up capability work is out of scope for the RSP; I think of the risk from this threat model as indeed centrally about things like deliberately screwing up alignment work, misaligning a successor model, disabling safeguards and control measures, etc.
@David_Kasten@TheStalwart That text is from the blog post, not the system card! But yeah, I do not think the models are anywhere near the level of discernment and judgment where I feel good about deferring to them on substantive system card content, and I try to be pretty strict about "no slop allowed".
@David_Kasten@TheStalwart I *do* often trust the models to handle a ton of experiment implementation and plotting and so on, but I think the actual writing up and presenting of those results, and framing them with appropriate emphasis and caveats, is something they're quite bad at thus far.
@David_Kasten@TheStalwart That text is from the blog post, not the system card! But yeah, I do not think the models are anywhere near the level of discernment and judgment where I feel good about deferring to them on substantive system card content, and I try to be pretty strict about "no slop allowed".
@AaronBergman18@AndyMasley (But I should confess a bias that some part of my brain feels on a gut level that, oh my god it's just normal air, if you don't actively feel bad right now it's gotta be fine, how could this be a problem. 1890-Drake probably dislikes germ theory for this reason and is wrong.)
@AaronBergman18@AndyMasley (See also: everyone being extremely anal about CO2 when nuclear submarine personnel seem to function fine at thousands of PPM for months and the good papers don't see major cognitive effects til a 100x scaleup. I think it's nocebo + random other correlates of poor ventilation.)
@DanielCHTan97 (I have no particular insider knowledge here)
Seems a bit weird since there's a numbered x axis for a similar graph further down, but FWIW this kind of thing isn't unique to A\: eg OAI's recent blog post on CoT grading doesn't give an absolute x scale. https://t.co/PVuGey0jrI
@deanwball I'm also worried about AI persuasion that's much more multimodal, or involves human impersonation, but that feels like moving goalposts and is less centrally what I think of as the superpersuasion threat model (eg it doesn't matter much for alignment oversight concerns).
@deanwball But on the other hand I look at things like 4o, or the right tail of human persuasive writing, and I don't feel like I can easily rule out much more untapped potential from very very high speaker intellect + modeling ability.