@UrielDolev@AsaCoopStick Not yet - I would expect similar things might help on other models, though perhaps not to the same extent. You'll need to swap out "thinking" for whatever terminology the model likes to use (e.g. OpenAI models call their CoT the "analysis channel")
@UrielDolev@AsaCoopStick Shared the prompt here! https://t.co/ss0kOfTwj0
We didn't use the work from that paper, although it is of course very relevant if you're trying to rely on low CoT controllability in practice
We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵
Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵
I helped write this report on oversight of AI systems and how it could degrade - it's a great overview, and a good guide to what research directions might help us maintain the level of oversight we enjoy today
The safety of advanced AI systems increasingly depends on the ability to oversee them. Our new report examines today’s AI oversight landscape, finding many pathways likely to lead to its degradation.🧵
There are a lot of pathways via which AI oversight is likely to degrade! Latent reasoning architectures, situational awareness, representational drift... We wrote a report ranking them.
Here I'll go into some which worry me most 🧵
Thanks @Jack_W_Lindsey for your comments on the eval awareness work my team put out last month (primarily @thjread) - we've edited the post to add this (with permission) for those interested in eval awareness / steering https://t.co/d60u0OmVU8
(My team) Model Transparency at @AISecurityInst is hiring Research Engineers and Research Scientists! Our aim is to protect oversight of frontier AI even as they become harder to evaluate, monitor and trust. As capabilities scale, this is becoming a harder and more important problem. 🧵
@wassname Oh interesting! I hadn't thought of that but it does seem pretty plausible (like how I think SFT tends to overwrite the most recently learned things)
Our key finding: "control" steering vectors, derived from contrastive pairs unrelated to alignment, can have effects as large as deliberately designed evaluation-awareness vectors. A vector about placing a book on the top vs bottom shelf had the largest effect of any we tested.
@Steven_B_Lee It would be interesting to compare to steering with genuinely random directions (not from contrastive pairs), which I would expect to have much less of an effect
With an open-weight model that games evaluations in hand, we were able to follow the Opus 4.6 system card approach, steering GLM-5 on the Agentic Misalignment blackmail scenario.
@Steven_B_Lee I don't think they're doing nothing! I think they actually are affecting the model's behaviour in ways that end up quite coherently affecting e.g. eval awareness, but I don't know why exactly each individual vector does what it does
@wassname some points in favour of steering:
- it's able to sometimes reveal eval gaming in cases where the unsteered model never does the misaligned behaviour
- in practice it doesn't seem prone to false positives
@Jandrade0112 Yeah - I guess if you saw serious misalignment from this sort of steering I'd still want you to investigate seriously (and not dismiss it as being "smaller than baseline"), but also we should just get better methods
This work was done at UK AISI, advised by @JBloomAus and @BronsonSchoen
Finding evaluation gaming in GLM-5: https://t.co/KqnzgqVDzO
Steering against evaluation awareness: https://t.co/4vadBmpcdb
We think it’s valuable to reproduce safety relevant phenomena and methods on open-weight models, especially with white-box methods; we think this work is a good example. Based on our results, we made a mild downward update on the amount of evidence we think steering provides.