Reasoning models think before they answer. Can you steer their behavior by editing their thoughts?
We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵
On-policy resampling doesn’t steer behavior well. The model just rephrases the same behavior. Off-policy edits can actually change the trajectory.
Thought editing works on its own, and it can also be combined with prompt optimization.
@Hiteshdotcom Another possible conclusion from your observation is the reverse: What's the point of knowing how to manipulate the DOM in core JS if you can build crazy good projects without that knowledge?