Stefan Heimersheim @sheimersheim - Twitter Profile

about 1 month ago

We're excited that one of our reliably excellent mentors is back. Stefan Heimersheim (@sheimersheim, Adecco/Google DeepMind) has run great projects with Pivotal fellows for three cohorts running, and is taking on fellows again to push on the more neglected corners of mech interp: activation plateaus, computation in superposition, and toy models that actually capture what's happening in LLMs.

pivotal_org's tweet photo. We're excited that one of our reliably excellent mentors is back. Stefan Heimersheim (@sheimersheim, Adecco/Google DeepMind) has run great projects with Pivotal fellows for three cohorts running, and is taking on fellows again to push on the more neglected corners of mech interp: activation plateaus, computation in superposition, and toy models that actually capture what's happening in LLMs.

1

16

1

4

868

sheimersheim retweeted

Luca Baroni @LuchinoBaroni

11 months ago

Excited to share our new paper (+ LW post): "Transformers Don't Need LayerNorm at Inference Time" We show that LayerNorm (LN) can be removed from GPT-2 models (even XL) with minimal performance loss 📄 https://t.co/QI4pgrAVK1 🧵

2

0

502

sheimersheim retweeted

Jai Bhagat

@jkbhagatio

12 months ago

🧵Excited to announce our work on analyzing toy models of computation in superposition (CiS) -- was fun working with @molas_sara, @giglema, and @sheimersheim on this! ❗Main takeaway: we show that toy models in Apollo Research's APD paper are not actually performing CiS!

2

4

2

1

499

Stefan Heimersheim @sheimersheim

about 1 year ago

Deadline: April 9 (23:59 UTC) Info: https://t.co/39F89s3cdx Application: https://t.co/j6dgaZNfan

0

1

0

166

Who to follow

Neel Nanda

@NeelNanda5

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

Alex Turner

@Turn_Trout

Cross-posting only; contact me at [email protected] or https://t.co/WQQtBrKQps Vegan, 10% of my income pledged to effective charities (GWWC)

Jesse Hoogland

@jesse_hoogland

Researcher and decel working on developmental interpretability. Coufounder @ Sequent

Stefan Heimersheim @sheimersheim

about 1 year ago

Applications are open for the Pivotal 2025 Q3 Research Fellowship (June 30 to August 29, London). Come work with me on cool mechanistic interpretability projects! Feel free to email or Slack-DM me if you have questions or want to discuss project ideas!

Pivotal Research

@pivotal_org

about 1 year ago

Applications to our Q3 Research Fellowship are now open! → June 30 – Aug 29 in London at the London Initiative for Safe AI → Work on AI safety with the guidance of your experienced mentor and research manager → £5,000 stipend + meals, travel & housing support (link in bio)

pivotal_org's tweet photo. Applications to our Q3 Research Fellowship are now open!

→ June 30 – Aug 29 in London at the London Initiative for Safe AI
→ Work on AI safety with the guidance of your experienced mentor and research manager
→ £5,000 stipend + meals, travel & housing support

(link in bio) https://t.co/8aeXMv5dkx

1

32

4

21

244K

1

2

0

375

Stefan Heimersheim @sheimersheim

over 1 year ago

But can it play doom?

Anthropic

@AnthropicAI

over 1 year ago

How we developed computer use, Claude's newest ability: https://t.co/gAAULqZAM8

87

2K

336

656

802K

0

3

0

440

sheimersheim retweeted

Nora @schottkey

over 1 year ago

1/7 Excited to share our recent project from LASR Labs! We investigated on the utility of SAE latents in language models. #MechanisticInterpretability #SAE Here's what we discovered: 🧠🔍

1

4

1

2

710

sheimersheim retweeted

Apollo Research

@apolloaievals

about 2 years ago

We’ve released a new mechanistic interpretability approach. We use the loss landscape to identify computationally relevant features and interactions. Then, we build a full interaction graph and interpret it. Theory: https://t.co/oaa2TL4XCh Experimental: https://t.co/A9cTFIpkIe

apolloaievals's tweet photo. We’ve released a new mechanistic interpretability approach. We use the loss landscape to identify computationally relevant features and interactions. Then, we build a full interaction graph and interpret it.
Theory: https://t.co/oaa2TL4XCh
Experimental: https://t.co/A9cTFIpkIe https://t.co/80rEIL1ySD

2

140

28

109

20K

Stefan Heimersheim @sheimersheim

about 2 years ago

Excited to share our write-up on activation patching best practices for mechanistic interpretability, with @NeelNanda5! Discussing noising vs. denoising and what's necessary vs. sufficient. Plus tips on which metrics to use to avoid common pitfalls. https://t.co/4kRp9VqDJt

2

57

8

36

10K

Stefan Heimersheim @sheimersheim

over 2 years ago

At #NeurIPS2023, DM me if you’d like to chat! Would love to hear about your interp ideas, and general AI alignment ideas

1

7

0

502

Stefan Heimersheim @sheimersheim

about 3 years ago

Our second claim is that the MLP simply implements an AND gate between these two filters (illustrated below), we test this using Causal Scrubbing and recover >94% performance under all allowed resample ablations.

0

2

0

201

Stefan Heimersheim @sheimersheim

about 3 years ago

@MariusHobbhahn and I also solved @stephenlcasper's second challenge "A Challenge for Mechanists"! Our write-up: https://t.co/AE8SdF3q3c Summary in thread 🧵

1

13

2

1

1K

Stefan Heimersheim @sheimersheim

about 3 years ago

@MariusHobbhahn @StephenLCasper Our first claim is that the model embeddings just learned two input filters (shown below in red/blue). We test this by picking inputs according to the filter colors, and indeed see that those determine the internal representation of the inputs in the residual stream (right plot).

sheimersheim's tweet photo. @MariusHobbhahn @StephenLCasper Our first claim is that the model embeddings just learned two input filters (shown below in red/blue). We test this by picking inputs according to the filter colors, and indeed see that those determine the internal representation of the inputs in the residual stream (right plot). https://t.co/bWosedPh6a

1

4

0

271

Stefan Heimersheim @sheimersheim

about 3 years ago

Finally we use Causal Scrubbing to test whether our interpretation of the 200 neurons was correct, replacing every neuron's activation by that from a different input, randomly selected only to have a similar "1"-ness or "Anti-1"-ness, and it works, we recover 94% performance!

sheimersheim's tweet photo. Finally we use Causal Scrubbing to test whether our interpretation of the 200 neurons was correct, replacing every neuron's activation by that from a different input, randomly selected only to have a similar "1"-ness or "Anti-1"-ness, and it works, we recover 94% performance! https://t.co/BMbaBn7iBt

0

6

0

1

325

Stefan Heimersheim @sheimersheim

about 3 years ago

@MariusHobbhahn and I solved the first mechanistic interpretability challenge in @stephenlcasper's "A Challenge for Mechanists"! Our write-up: https://t.co/vD2sj9X33E Summary in thread 🧵

3

62

9

24

7K

Stefan Heimersheim @sheimersheim

about 3 years ago

We test this if our hypothesis, that classification corresponds to similarity with "1" and "Anti-1", by manually computing the similarity manually (dotted lines) and find a 96% overlap with the neural network output (colors)!

sheimersheim's tweet photo. We test this if our hypothesis, that classification corresponds to similarity with "1" and "Anti-1", by manually computing the similarity manually (dotted lines) and find a 96% overlap with the neural network output (colors)! https://t.co/Ov61CiPmCp

1

0

364

Stefan Heimersheim

@sheimersheim

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users