I work in AI policy full-time.
I'm taking time off work, volunteering 12+ hours a day to get Alex Bores elected.
If you're a Democrat in Manhattan, I'd love talk with you for 15 min about where AI is heading, AI risks, and why we urgently need voices like Alex in Congress.
sign up here: https://t.co/ZQiWlTvEQe
(All opinions are my own and do not represent the views of my employer, a 501c3 that does not endorse political candidates.)
@NathanpmYoung Basically "no", imo. It matters a lot that jailbreaks require a human to try to interfere with the developer's intentions. The alignment problem is about what the system itself is trying to do; even if you perfectly prevent jailbreaks a misaligned model will seek undesirable ends
@kave_rennedy On the other hand, it does seem clear to me that most people are not in fact optimizing most things they have control over very much, and are prone to assume that a thing is best when (e.g.) it merely *was* best
@freed_dfilan specific exchange that triggered this: https://t.co/jwnSBbZ70x
It's possible I'm misunderstanding deepfates, but I have a sense of this pattern from at least 4-5 prominent conversations here
Why is it that a certain brand of "post-rationality" (e.g. "you can't use ordinary reasoning to think about humanity and what we might collectively do") seems to almost exclusively get deployed around AI stuff? Feels suspicious to me. Like, where was this PoV during early covid?
So uh, literally every time I've seen someone intercalating their statement with "π" emojis, I have both disagreed with the thing they were saying and felt annoyed at them. Even when they were doing it in a way that was meant to be tongue in cheek.
@HumanHarlan There are also way more mundane reasons for people to be lying to each other within the admin, e.g. "oh, yeah, I totally got input from them like I said I would"
@freed_dfilan (also:
- potentially easier to prevent
- maybe costlier per instance of knowledge
- doesn't work for "unknown unknowns", i.e. knowing about something makes the existence of that knowledge and its object way more salient)
@freed_dfilan Totally, though obviously if it's looking something up it's usually going to be way more legible from the outside, vs when it knows those things from training
@RyanPGreenblatt Wait, really? I feel like your list of bullet points is ~ exhaustive re my sense of the question, and I'm curious what other bullet points would shift it to "yes"; my guess is "clearly being takeover" alone might be sufficiently evil-coded? But LLMs *really* love finishing tasks!