@yoavgo Yeah! We did some experiments in this work where we find that bias only fine tuning generalizes better than activation steering for behavior modifications https://t.co/enWCxfz9Z8
Can we find weight directions to modify LLM's behaviors?
Our new paper proposes contrastive weight steering, an alternative to activation steering for modifying behaviors using small narrow distribution data 🕹️
��👇
At the #Neurips2025 mechanistic interpretability workshop I gave a brief talk about Venetian glassmaking, since I think we face a similar moment in AI research today.
Here is a blog post summarizing the talk:
https://t.co/LSwBf9XQzE
@ESRogs@DanielCHTan97 We actually tried this as a baseline in the experiments and for some behaviors it works, but for others it fails completely (steering towards non-sycophancy)
Can we find weight directions to modify LLM's behaviors?
Our new paper proposes contrastive weight steering, an alternative to activation steering for modifying behaviors using small narrow distribution data 🕹️
��👇
Can we find weight directions to modify LLM's behaviors?
Our new paper proposes contrastive weight steering, an alternative to activation steering for modifying behaviors using small narrow distribution data 🕹️
🧵👇