Recently Wen et al found an unsupervised elicitation technique with similar performance to fully supervised fine-tuning on small training sets. We wanted to see which aspects make it work, and found some simple methods that get comparable results. 🧵
New Anthropic Fellows research: the Assistant Axis.
When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off?