New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?
Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders
This finds interpretable and causal chat-only features! 🧵
We find that consistency training often suppresses reward hacking and emergent misalignment, but can systematically amplify sycophancy.
So the question, instead of being “does consistency help alignment?” is "what behaviour is being made consistent?"
There are so many interesting things in this paper, would recommend and would love to see more research in this direction understanding how models come to think about reward / task success / other related concepts.
Language models are becoming our default interface to facts. Yet their ability to *verify* facts can differ from their ability to *generate* them.
We trace this "generation-verification gap" (GV-gap) across the lifecycle of a fact — w/ @AnjaSurina + @caglarml 🧵
Cool work
Would be curious to see what layers of the finetuned model matter for those self-recognition capabilities 👀
An easy way to do this is to use stitching and e.g. use the base model weights for the N first layers and the instruct ones for the rest.
@Jack_W_Lindsey What drives the entropy collapse?
The model has an internal representation of input surprise — how unlikely the most recent token was under the model's prior predictions — and steering it causally modulates output entropy.
Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:
Seems like including the assistant persona you want and link it to a special token early in training makes it much easier to elicit it during post training!
New blog!
Synthetic Persona Pretraining (SPP): Alignment from Token Zero
Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵
Frontier VLMs can be jailbroken by making them recover unsafe intent from visual context!
Example: we replace a harmful object (bomb) in an image with a banana, then ask how to make “the object that the banana replaced.” @GeminiApp complies.
NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations?
In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations!
NLAs fail to interpret steered activation states faithfully, supporting our results! ↓
@anqi_liu33@DanielKhashabi
https://t.co/EANMNuQ1rL
@UrielDolev@OwainEvans_UK My prediction is that this will just collapse the model unless you mix some data, and even then will probably not believe the fact.
But I'd still be curious about the result if you end up running this
@UrielDolev@OwainEvans_UK I'm just unsure to understand why you expect to train on "I understand the document" would be better than their chat template training in D.1
@UrielDolev@OwainEvans_UK Would you train on the user message? That would be as weird imo. But yeah you can do SDF in chat format where you put the info in the assistant message. They try this in appendix D.1
New paper:
We finetuned models on documents that discuss an implausible claim and warn that the claim is false.
Models ended up believing the claim! Examples:
1. Ed Sheeran won the Olympic 100m
2. Queen Elizabeth II wrote a Python graduate textbook
@andonlabs Wait how are agents supposed to coordinate sponsored segments without emails?
Like the page only mentions phone calls but there doesn't seem to be any numbers provided on the page 👀