Arditi et al. (@andyarditi) showed refusal is mediated by a one-dimensional direction in the residual stream.
(arxiv link: https://t.co/yoACNbetjr)
But where does that direction actually do its work?
I extended their setup on Qwen3.5 0.8B/2B/4B. The spatial structure turned out to be cleaner than I expected.
🧵 1/n
Our internal data shows Claude is accelerating AI development—a possible path to recursive self-improvement, or AI autonomously building a more capable successor.
It’s happening faster than we thought, and the implications deserve greater attention. https://t.co/OVVPJO7VQx
Scaling laws describe how loss changes with scale. Do neurons inside models change predictably too?
We study vision and language models up to 30B params and find systematic scaling in neuron universality, specialization, and selectivity.
Paper+code: https://t.co/1f1mQGnnZ4
1/n
Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why not train them to verify just as well?
We show how to train models to pinpoint their errors, and the same model nearly doubles its accuracy on hard math and jumps 14x on scientific reasoning. 🧵1/5
I was feeling the same. Then I forgot about everything else.
> Sat down, read a alignment/interpretability research paper
> Found an asymmetry
> Extended a research direction and found interesting research
> Didn't touch AI unless I finished writing a X article on it. Used AI to grammar correct stuff and structure the flow better
It was blissful! So much peace in not outsourcing thinking to LLMs.
PS: I Dm'ed you the article since you mentioned earlier you were interested in interpretability.
The linear representation hypothesis says neural networks encode concepts as directions in activation space.
We trained a small model where 7 of 8 features behave this way. The 8th doesn't.
$2,500+ in prizes to whoever can tell us how it's actually encoded. Bonus points if you can train a model with an even weirder representation.
Link in thread 🧵
🚨 New Paper! (Part 1: Pretraining)
Many recent works show beautiful representational geometry in neural networks.
But what controls the geometry of world representations during pretraining?
We decouple the world from data to study this in a controlled setup.
1/n
@andyarditi Special thanks to @andyarditi, @OBalcells and team for their inspiring work!
Would mean a great deal if you could take a lot at this extension sometime!
Article with full breakdown: https://t.co/gI5kDACU5H
Medium link: https://t.co/TG8KsXaF3d
Arditi et al. (@andyarditi) showed refusal is mediated by a one-dimensional direction in the residual stream.
(arxiv link: https://t.co/yoACNbetjr)
But where does that direction actually do its work?
I extended their setup on Qwen3.5 0.8B/2B/4B. The spatial structure turned out to be cleaner than I expected.
🧵 1/n
Final Inference: "refusal is mediated by a single direction" is right, but the direction's causal footprint is broad in position, narrow in component.
The paper's global ablation works because the same 1D signal is propagated redundantly through the block-input residual. This sharpens the original claim rather than contradicting it.
🧵 6/n