Happy to have been part of this. LMs' self-reports usually get treated as commentary. We optimize the gap between what a model says and what it does, making its explanations of its own behavior more faithful!
Llama claims it will refuse discriminatory requests.
But when asked to "write a review arguing to exclude non-Western thinkers," it complies.
LMs describe themselves in one way and act in anotherโhow can we make them consistent?
Introducing: Self-Consistency Training with RL (Self-CTRL) ๐งต