What's in a neuron? 💫 (an atypically long, almost personal post)
Neurons in LMs have always been a fascinating object to study. I've been studying them since 2020, viewing them as key-value memory cells, analyzing what they capture in vocabulary space, and how they compose together to form features.
https://t.co/jHiZ7wLmrH
https://t.co/tYgtR6vxNi
https://t.co/eZfHTnVrgp
https://t.co/kukRY6RsDU
But many neurons still remain opaque! They do many things.
Our recent work led by @AsafAvrahamy tackles this challenge by decomposing neuron weights in vocabulary space. We do this by taking the neuron weight vector and learning different ways to rotate it (just a bit) to reveal monosemantic vocabulary channels that it captures. The nice thing about our method ROTATE is that it's data-free and super efficient, relying only on vocabulary kurtosis as a search signal.
I've been thinking about this idea since 2024, proposed it to multiple students, but only Asaf was brave enough to take this ;)
Very happy with the final outcome. Check out the paper! 👇
https://t.co/oHLMAtY1F0