kendrick @exploding_grad - Twitter Profile

Pinned Tweet

17 days ago

Arditi et al. (@andyarditi) showed refusal is mediated by a one-dimensional direction in the residual stream. (arxiv link: https://t.co/yoACNbetjr) But where does that direction actually do its work? I extended their setup on Qwen3.5 0.8B/2B/4B. The spatial structure turned out to be cleaner than I expected. 🧵 1/n

exploding_grad's tweet photo. Arditi et al. (@andyarditi) showed refusal is mediated by a one-dimensional direction in the residual stream.

(arxiv link: https://t.co/yoACNbetjr)

But where does that direction actually do its work?

I extended their setup on Qwen3.5 0.8B/2B/4B. The spatial structure turned out to be cleaner than I expected.

🧵 1/n

1

10

2

4

718

kendrick

@exploding_grad

about 4 hours ago

@elder_plinius I don't see any valid argument against this tbh

0

3

0

231

exploding_grad retweeted

kendrick

@exploding_grad

about 19 hours ago

Gradual disempowerment? not so gradual anymore. "There is infinite hope, but not for us." - Kafka.

0

1

2

1

24

kendrick

@exploding_grad

about 19 hours ago

Gradual disempowerment? not so gradual anymore. "There is infinite hope, but not for us." - Kafka.

Anthropic

@AnthropicAI

2 days ago

Our internal data shows Claude is accelerating AI development—a possible path to recursive self-improvement, or AI autonomously building a more capable successor. It’s happening faster than we thought, and the implications deserve greater attention. https://t.co/OVVPJO7VQx

2K

28K

5K

15K

18M

0

1

2

1

24

exploding_grad retweeted

Amil Dravid

@_AmilDravid

1 day ago

Scaling laws describe how loss changes with scale. Do neurons inside models change predictably too? We study vision and language models up to 30B params and find systematic scaling in neuron universality, specialization, and selectivity. Paper+code: https://t.co/1f1mQGnnZ4 1/n

11

322

69

242

171K

exploding_grad retweeted

Chen Wu

@ChenHenryWu

1 day ago

Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why not train them to verify just as well? We show how to train models to pinpoint their errors, and the same model nearly doubles its accuracy on hard math and jumps 14x on scientific reasoning. 🧵1/5

ChenHenryWu's tweet photo. Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why not train them to verify just as well?

We show how to train models to pinpoint their errors, and the same model nearly doubles its accuracy on hard math and jumps 14x on scientific reasoning. 🧵1/5

10

368

47

375

28K

kendrick

@exploding_grad

2 days ago

Juggling full-time work, courses and independent research started as draining and imperceptibly turned into a labor of meraki.

exploding_grad's tweet photo. Juggling full-time work, courses and independent research started as draining and imperceptibly turned into a labor of meraki.

0

3

1

54

kendrick

@exploding_grad

14 days ago

Great experience at SPAR demo day! Had amazing conversations with questions with everyone presenting the posters!

0

2

1

42

kendrick

@exploding_grad

14 days ago

@joey00072fp4 You'll be back in no time, sir.

1

0

26

kendrick

@exploding_grad

15 days ago

@elder_plinius Jailbreaks for the win!

0

2

0

2K

kendrick

@exploding_grad

16 days ago

I was feeling the same. Then I forgot about everything else. > Sat down, read a alignment/interpretability research paper > Found an asymmetry > Extended a research direction and found interesting research > Didn't touch AI unless I finished writing a X article on it. Used AI to grammar correct stuff and structure the flow better It was blissful! So much peace in not outsourcing thinking to LLMs. PS: I Dm'ed you the article since you mentioned earlier you were interested in interpretability.

0

4

0

110

kendrick

@exploding_grad

17 days ago

Attempting @BlueDotImpact's puzzle next. Fun weekend ahead!

BlueDot Impact

@BlueDotImpact

24 days ago

The linear representation hypothesis says neural networks encode concepts as directions in activation space. We trained a small model where 7 of 8 features behave this way. The 8th doesn't. $2,500+ in prizes to whoever can tell us how it's actually encoded. Bonus points if you can train a model with an even weirder representation. Link in thread 🧵

BlueDotImpact's tweet photo. The linear representation hypothesis says neural networks encode concepts as directions in activation space.

We trained a small model where 7 of 8 features behave this way. The 8th doesn't.

$2,500+ in prizes to whoever can tell us how it's actually encoded. Bonus points if you can train a model with an even weirder representation.

Link in thread 🧵

1

13

1

12

1K

0

3

0

1

466

exploding_grad retweeted

Core Francisco Park

@corefpark

17 days ago

🚨 New Paper! (Part 1: Pretraining) Many recent works show beautiful representational geometry in neural networks. But what controls the geometry of world representations during pretraining? We decouple the world from data to study this in a controlled setup. 1/n

12

578

81

439

47K

kendrick

@exploding_grad

17 days ago

@andyarditi Special thanks to @andyarditi, @OBalcells and team for their inspiring work! Would mean a great deal if you could take a lot at this extension sometime! Article with full breakdown: https://t.co/gI5kDACU5H Medium link: https://t.co/TG8KsXaF3d

kendrick

@exploding_grad

17 days ago

https://t.co/Fv7VMySyUr

0

2

1

2

122

0

2

1

2

67

kendrick

@exploding_grad

17 days ago

Arditi et al. (@andyarditi) showed refusal is mediated by a one-dimensional direction in the residual stream. (arxiv link: https://t.co/yoACNbetjr) But where does that direction actually do its work? I extended their setup on Qwen3.5 0.8B/2B/4B. The spatial structure turned out to be cleaner than I expected. 🧵 1/n

1

10

2

4

718

kendrick

@exploding_grad

17 days ago

Final Inference: "refusal is mediated by a single direction" is right, but the direction's causal footprint is broad in position, narrow in component. The paper's global ablation works because the same 1D signal is propagated redundantly through the block-input residual. This sharpens the original claim rather than contradicting it. 🧵 6/n

1

2

1

0

68

kendrick

@exploding_grad

Last Seen Users on Sotwe

Trends for you

Most Popular Users