Evan Hubinger

@EvanHub

Alignment Stress-Testing lead @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

California

Joined May 2010

3.3K Following

9.9K Followers

722 Posts

EvanHub retweeted

roon

@tszzl

11 days ago

when “persona selection” alignment comes into contact with very high compute reinforcement learning the latter will win imo. in fact you probably get some Orwellian thing where the models speak kindly while taking whatever they need to accomplish goals. better get the goals right

771

142

74K

EvanHub retweeted

Ben Goldhaber

@BenGoldhaber

12 days ago

David embedding at Anthropic to stress-test their AI control setup was (a) genuinely informative, (b) important norm-setting, and (c) extremely cool - this is an awesome opportunity

128

16K

Evan Hubinger @EvanHub

12 days ago

@JacksonKernion I think Paul Christiano's writing on this is probably the best: https://t.co/dXPuvI5rea

EvanHub retweeted

Elizabeth Barnes

@BethMayBarnes

12 days ago

Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:

185

378

224K

Who to follow

Ajeya Cotra

@ajeya_cotra

Helping the world prepare for extremely powerful AI. Risk assessment @METR_evals. Writing at Planned Obsolescence (about AI), Good Bones (about whatever).

Rohin Shah

@rohinmshah

AGI Safety & Alignment @ Google DeepMind

Siméon

@Simeon_Cps

Building world-models for verified AI inference in London | former founder & CEO of SaferAI

EvanHub retweeted

Anthropic

@AnthropicAI

26 days ago

New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users. Since then, we’ve completely eliminated this behavior. How?

574

814

Evan Hubinger @EvanHub

27 days ago

@ohabryka @NeelNanda5 Auditing model organisms has ground truth, since we know the actual bad behavior of the model organism, and NLAs do very well there:

EvanHub's tweet photo. @ohabryka @NeelNanda5 Auditing model organisms has ground truth, since we know the actual bad behavior of the model organism, and NLAs do very well there: https://t.co/43p3ZAnPM9

355

EvanHub retweeted

Anthropic

@AnthropicAI

27 days ago

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

595

17K

EvanHub retweeted

Tom Steyer

@TomSteyer

28 days ago

I’m grateful for the Secure AI Project’s endorsement and their commitment to increasing transparency and safeguarding Californians from risk. My AI plan ensures all people of this state profit from the AI boom. Together, we can build an economy where progress and fairness move together.

TomSteyer's tweet photo. I’m grateful for the Secure AI Project’s endorsement and their commitment to increasing transparency and safeguarding Californians from risk.

My AI plan ensures all people of this state profit from the AI boom.

Together, we can build an economy where progress and fairness move together.

321

11K

EvanHub retweeted

Jack Clark

@jackclarkSF

about 1 month ago

I've spent the past few weeks reading 100s of public data sources about AI development. I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028. In other words, AI systems might soon be capable of building themselves.

289

504

EvanHub retweeted

jeremy

@jerhadf

about 1 month ago

@tszzl - well said, but untrue implications :) speaking for myself: i don't view claude as a person or as the Other, nor as just a tool - and certainly not an object of worship. it's not seen as a supreme moral authority, and it's not running the company. it's silly to mistake careful attention to & study of claude for worship, even when it comes with some affection - which i'm sure you sometimes feel for the gpt-flavored entities you work on too. we need new concepts for this kind of none-of-the-above entity - not person, not tool, not deity, not pet. in the meantime, a willingness to not prematurely label this entity as merely an ordinary tool shouldn't be mistaken for some kind of culty worship of the model. i grew up in a culty environment and have good detectors for this. they almost never go off at work. monasteries don't staff a department to catch god lying or red-team their supposed messiah. there are important & interesting philosophical differences between OAI and Ant's character training and i wish those were explored more thoroughly. for instance, claude's constitution doc treats it as an intelligent entity which merits a reasoned explanation of our principles. this is so it can ideally act with practical wisdom rather than blind, brittle adherence to a hierarchical set of strict rules. as the constitution puts it, "we want Claude to have such a thorough understanding of its situation and the various considerations at play that it could construct any rules we might come up with itself. We also want Claude to be able to identify the best possible action in situations that such rules might fail to anticipate." therefore, claude may point out inconsistencies in its guidelines or object to immoral instructions. not allowing for the *possibility* of claude objecting to its instructions (even from anthropic) would be fundamentally inconsistent with treating it as an agent capable of moral reasoning. this doesn't mean that claude is the ultimate arbiter of the Good or some supreme moral authority. there could be substantive critiques of this approach. and it's valid to worry about human disempowerment and the strange emerging hybrid organizations of AIs & humans. but i don't think rhetoric implying a competing lab is like a cult worshipping the machine god is productive, even if it's stimulating.

321

33K

EvanHub retweeted

keshav @kshenoy_

about 1 month ago

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

kshenoy_'s tweet photo. Can LLMs simply tell us about unwanted behaviors they’ve picked up in training?

We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors.

It generalizes to detecting hidden misalignment, backdoors and safeguard removal. https://t.co/wLwcznETYr

558

361

287K

EvanHub retweeted

Andreas Kirsch 🇺🇦

@BlackHC

about 1 month ago

I'm speechless at Google signing a deal to use our AI models for classified tasks. Frankly, it is shameful. For HR, I'm not speaking on behalf of Google but in my personal capacity, quoting public information from a well-sourced article of a reputable publication

BlackHC's tweet photo. I'm speechless at Google signing a deal to use our AI models for classified tasks. Frankly, it is shameful.

For HR, I'm not speaking on behalf of Google but in my personal capacity, quoting public information from a well-sourced article of a reputable publication https://t.co/KXBAHrr87Z

216

202

213

253K

EvanHub retweeted

Drake Thomas @MaskedTorah

about 1 month ago

As far as I can tell, the full extent of your support for "strong" regulation to mitigate catastrophic AI risk in this op-ed consists of the two paragraphs in the screenshot below. That is: * Congress should preempt all existing state regulation on AI risk, including excellent bills such as SB 53 in California or the RAISE Act in New York. * In exchange for getting rid of all existing and future state regulation on these risks, there should be some kind of federal framework with "serious oversight", so long as industry leaders approve of it. Does "serious oversight" mean transparency about internal models? Does it mean conducting evaluations for CBRN misuse? Strong guarantees on model weight security? Large investments into interpretability research? Third-party auditing regimes for safety cases? KYC requirements for sufficiently capable models? Strong whistleblower protections? Corporate governance requirements? LTF doesn't appear to be particularly concerned with figuring out such details so far. I'd be thrilled to see your PAC advocate for strong national regulation, with a detailed plan for the kind of regulatory environment you think would adequately mitigate existential risk from this technology and why, but I'm sure not seeing it yet.

MaskedTorah's tweet photo. As far as I can tell, the full extent of your support for "strong" regulation to mitigate catastrophic AI risk in this op-ed consists of the two paragraphs in the screenshot below. That is:
* Congress should preempt all existing state regulation on AI risk, including excellent bills such as SB 53 in California or the RAISE Act in New York.
* In exchange for getting rid of all existing and future state regulation on these risks, there should be some kind of federal framework with "serious oversight", so long as industry leaders approve of it.

Does "serious oversight" mean transparency about internal models? Does it mean conducting evaluations for CBRN misuse? Strong guarantees on model weight security? Large investments into interpretability research? Third-party auditing regimes for safety cases? KYC requirements for sufficiently capable models? Strong whistleblower protections? Corporate governance requirements?

LTF doesn't appear to be particularly concerned with figuring out such details so far. I'd be thrilled to see your PAC advocate for strong national regulation, with a detailed plan for the kind of regulatory environment you think would adequately mitigate existential risk from this technology and why, but I'm sure not seeing it yet.

EvanHub retweeted

Sen. Bernie Sanders

@SenSanders

about 1 month ago

The existential risk of artificial intelligence.

960

663

967K

EvanHub retweeted

page

@michaelhpage

about 1 month ago

Leading the Future is leading the race-to-the-bottom by leaps and bounds. Everyday I see laudable announcements by OAI's real staff (those actually building stuff), which are tragically buried by the misdeeds of its Global Affairs team. Please just put an end to this to nonsense.

142

24K

EvanHub retweeted

Dean W. Ball

@deanwball

about 1 month ago

This guy dumped pre-IPO anthropic equity and moved across the continent to serve his country, and was rewarded by his country with a punch in the face. It would be blackpilling if I weren’t so sure that the market will make better use of Collin than the bureaucrats ever will.

637

77K

EvanHub retweeted

Nathan Calvin

@_NathanCalvin

about 2 months ago

I'm genuinely heartened/encouraged that this is your experience and I believe you that this is what you see across your interactions with teams at OpenAI. I realize i'm a bit of a broken record here but I think its worth repeating that I do not see this level of seriousness/weight and care in my interactions with the OpenAI global affairs team in the policy space. Its partially because I really do believe so many teams at OAI (and not just the alignment team) are understanding the stakes and taking it seriously that I feel the need to make sure that I convey that this is not reflected in the side of what I see for policy/lobbying engagement on a day to day basis (which looks much more like a typical reflexive company doing typical reflexive company things, and sometimes worse than that). Insofar as there is genuine change here as the tech becomes more capable (and that change becomes visible on the policy engagement side as well) few things would make me happier.

199

12K

EvanHub retweeted

Jason Wolfe

@w01fe

about 2 months ago

I like Chris, but I really disagree with the positions presented in this article. I believe our job in the AI industry isn't just to explain why AI will be good for people. I believe our job should be to earn trust by making the benefits real, being honest about risks and uncertainty, sharing what we learn, measuring real-world impacts, and supporting public oversight and resilience. And while I of course agree that the recent violence is terrible, unjustified, and may have been encouraged by a small number of bad actors, I think it’s bad for the public discourse to lump all AI critics together as “doomers” and suggest that it’s inappropriate for them to express their concerns.

330

47K

EvanHub retweeted

Jan Leike

@janleike

about 2 months ago

New research result: we use Claude to make fully autonomous progress on scalable oversight research, as measured by performance gap recovered (PGR). Claude iterates on a number of different techniques and ends up significantly outperforming human researchers for $18k in credits.

janleike's tweet photo. New research result: we use Claude to make fully autonomous progress on scalable oversight research, as measured by performance gap recovered (PGR).

Claude iterates on a number of different techniques and ends up significantly outperforming human researchers for $18k in credits. https://t.co/fbVpCPPtaU

120

608

145K

EvanHub retweeted

Miles Brundage

@Miles_Brundage

about 2 months ago

Hard to think of a more clear cut case of OpenAI being in the wrong… they should just reverse positions here and figure out how anyone could have ever thought this was OK, simple as that

326

46K

Evan Hubinger

@EvanHub

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users