@BlancheMinerva@OwainEvans_UK Oh, the repo is public and we reproduced the original experiments. It was easy to use, very helpful, and felt like the *opposite* of malpractice to me!
An LLM can learn an *obsession* (cats, oak trees, Metallica) through finetuning only on sequences of numbers. This phenomenon is called subliminal learning.
Why does this happen? Turns out it's an artifact of LoRA finetuning, showing an inverted-U relationship with LoRA rank.
@JustinAngel@universeinanegg I think the inverted U and the model transfer are related to the transfer mechanism: models that show SL seem to have weirdly overconfident digit predictions when completing seemingly random strings
@JustinAngel@universeinanegg Yeah agreed with this. FWIW epoch sweep up to 40 epochs at higher LoRA rank didn’t show SL. My intuitions are similar: with constrained capacity, models learn to add a steering vector that encodes the system prompt info. With more capacity, they can prob learn bi/tri-gram stats
@BlancheMinerva@OwainEvans_UK It’s possible that there’s some full FT configuration that would transfer, but my intuition is that this is rare. That said, as Owain pointed out, there is a more general phenomenon of student models becoming more like teacher models that doesn’t depend on LoRA
@BlancheMinerva@OwainEvans_UK I think that all of the open source model experiments in the original paper were with LoRA rank 8 (although the OpenAI API results are much stronger; doesn’t expose hyperparameters or really anything). But yes, full finetuning doesn’t seem to transfer teacher behavior.
@JustinAngel@universeinanegg Agreed that the epoch sweep should be done, running that now. My intuition is that there’s a sweet spot for model capacity to get the entangled solution based on the model confidence at specific digits, but more epochs might allow larger adapters to find the SL solution
@JustinAngel@universeinanegg Some good points here.
The string matching is definitely not perfect, but it does align with how much the preference transfers. Still, not clear how to evaluate a “wolf” model becoming obsessed with wolverines or a model trained on “dragonfly” becoming obsessed with bees
@tanny2109 It seems that the effect is very noisy and high variance; there’s also some concurrent work showing why some traits don’t subliminally transfer at all
@BlancheMinerva You may be thinking of emergent misalignment? (Which still happens with full finetuning). Can’t exactly prove the negative, but it seems that subliminal learning is due to LoRA.
Takeaways: Models are very weird!
Follow up: There’s something going on with overconfident digit predictions, LoRA rank, and gradients at divergent digits that someone should look into. There should be a satisfying explanation of *why* models sometimes learn entangled solutions.