Incredible work by 3x @MATSprogram alumni and a great example of applied Mech Interp beating black box baselines and making significant progress on critical real-world problems:
@dab_chick@Benthamsbulldog The average consumer's expected impact is still ~1 because they have imperfect knowledge. The expected curve is effectively smooth for this reason.
When Role-Playing, Do Models Believe What They Say? (w/ @DavidDAfrica and @realmeatyhuman)
LLMs can say “The Earth revolves around the Sun” and then, when roleplaying as an ancient Greek historian, assert the opposite.
What changes inside the model when it acts like this? Does it just say things, or does it start to believe the role? 🧵
Model organisms are useful insofar as they are “scary property + normal model." Right now, many current organisms are more like “scary property + fried model”
In this post, we argue for more natural MOs: models that get the pathology without becoming otherwise fried!
@replyallguy@_sholtodouglas For the purpose of the comparison I think it's fair to consider saving lives more cheaply than Value of a Statistical Life as a saving/revenue analogue.
@DanielCHTan97@ben_sturgeon Oh I meant just to themselves or without any expectation of feedback, kind of an extension of imitation really. I think they basically say some stuff, then are sort of self-checking if that's roughly in distribution.
@DanielCHTan97@ben_sturgeon Tangentially, I have really vivid school memories of trying to predict teachers sentences in class, right down to arbitrary choices (like using 4 as a constant or a random female name) - but I'm pretty sure that's me just being a bit weird
@DanielCHTan97@ben_sturgeon Human conscious learning, agreed, but I think by volume most human learning looks more like toddlers imitating / rolling out high temp tokens and seeing what sticks which feels spiritually like pretraining.
We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵
*NEW* AI alignment research team!
We're announcing the new alignment team @ArcadiaImpact. A London-based team, working closely with @AISecurityInst to tackle 3 ambitious agendas in AI alignment!
👇 🧵
The @AISecurityInst is hiring for a Director and for a Chief Research Officer. AISI is a remarkable organisation: doing globally important work, with a world-class team, in the heart of government.
These are some of the highest impact jobs in AI security anywhere. Do consider applying and sharing widely.
Many methods use consistency as a way to make language models more capable or aligned, such as through self-distillation or regularisation.
In new work accepted to ICML 2026, @ArathiMani and I show that optimising for self-consistency can entrench pre-existing misalignment.