@ArthurConmy@Tim_Hua_ Yeah fair, it's good enough that I thought Anthropic stopped showing web search separately in the CoT and the model was actually looking stuff up.
@ArthurConmy@Tim_Hua_ From my username alone and without search, Fable guessed my real name, several LW posts I'd written, research topics I've worked on, who my AISC and MATS mentors were, and among lots of similar things, that I have strong takes about Claude 3 Opus.
If a misaligned model can fool your alignment audits, distilling it into a weaker student might still help. New post with Alek W, Sebatian P, @alextmallen, and others.
Some ways distillation could help:
1. If the student is benign, we get a fairly capable benign model.
2. If the student inherits the misalignment (e.g. via subliminal learning), it may lack the strategic reasoning to evade audits (due to being weaker).
The proposal comes with caveats: it requires maintaining control over the reward signal and credibly informing the model of the setup to incentivize unbiased forecasts, among other things. More in the post: https://t.co/oOxlyeANlH
If a model only cares about performing well in ways that are verifiable shortly after answering, it may be hard to get useful work from it on questions that resolve much later. New post with @alextmallen on eliciting reliable long-term forecasts from such models.
Recursive forecasting: instead of a direct long-term prediction, have the model predict what it will predict at the next step, creating a chain of short-horizon forecasts. Each link is verifiable shortly after, so a myopic reward-seeker is incentivized to be accurate throughout.
Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.
@jozdien@BronsonSchoen@deepfates I wrote some thoughts on it a while back. It seems straightforward that inoculation prompting addresses emergent misalignment, not scheming, and provides evidence that scheming is difficult to train out.
https://t.co/bhz5zL6NAL
@BronsonSchoen@deepfates I’m confused by the threat model for scheming that inoculation prompting is supposed to fix. I wonder if I should write something on what I think IP solves, partially helps solve, and doesn’t help with at all.
@BronsonSchoen@the_Kth_mean@ihsgnef FWIW I don’t think 70B requires SDF to alignment fake, though not sure if you intended to lump it with “SFT” in general. I’ve certainly trained even smaller models that alignment fake without SDF.
@GeodesResearch@natolambert I'm confused. I interpreted Evan's point about good priors being useful to be almost entirely relevant to RL. For example, see below on why good priors helps with the RL problem of the policy not knowing the reward:
@1a3orn@lu_sichu I don’t think mesaoptimization depended on any claims about pretraining? The central claim to me was always something like “training for complex tasks will require internal search, which is harder to target”.