Eric Todd @ericwtodd - Twitter Profile

Pinned Tweet

4 months ago

Can you solve this algebra puzzle? 🧩 cb=c, ac=b, ab=? A small transformer can learn to solve problems like this! And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:🧵⬇️

ericwtodd's tweet photo. Can you solve this algebra puzzle? 🧩

cb=c, ac=b, ab=?

A small transformer can learn to solve problems like this!

And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:🧵⬇️ https://t.co/4IRrEp1gDY

8

317

49

231

56K

ericwtodd retweeted

Thomas Fel

@thomas_fel_

about 13 hours ago

At CVPR this week for a talk on neural geometry of large vision models. If you’re interested in interpretability or joining @GoodfireAI, come say hi. 🤠

thomas_fel_'s tweet photo. At CVPR this week for a talk on neural geometry of large vision models. If you’re interested in interpretability or joining @GoodfireAI, come say hi. 🤠 https://t.co/guuZQGb3YQ

2

62

7

14

4K

ericwtodd retweeted

Rohit Gandikota @rohitgandikota

9 days ago

A popular way to use the latest FLUX model is to provide a reference image alongside the text prompt to guide the model. Surprisingly, in most cases, the model first writes the reference image information into the text tokens; only then does it use that to generate the image🧵👇

1

12

3

5

1K

ericwtodd retweeted

Rohit Gandikota @rohitgandikota

1 day ago

In 2023, we released ESD and UCE, unlearning methods for text-to-image diffusion models. After 3 years of research: Tomorrow, I will be presenting why "Unlearning is not the goal" at Machine Unlearning for Vision workshop @CVPR Hear me out 👀 🗓️: June 3rd, 2:20pm 📍: Room 1AB

0

19

5

2

3K

Who to follow

Subhabrata Mukherjee

@subho_mpi

Co-Founder & Chief Scientific Officer, @HippocraticAI. PhD. Head of AI. Former Principal Researcher @MicrosoftResearch.

Jonathan Bragg

@turingmusician

Leading agents R&D at AI2 @allen_ai. AI & HCI research scientist. Ex- @stanford, @uwcse, @harvard CS. Oboist & pianist @harvard-@necmusic dual-degree.

ericwtodd retweeted

14 days ago

The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)

24

1K

150

760

169K

ericwtodd retweeted

Sheridan Feucht @sheridan_feucht

21 days ago

Neural networks have beautiful feature geometry, but do they have mechanisms that actually interface with those structures? At @GoodfireAI this spring, we discovered one: a re-usable addition mechanism that reads/writes to Fourier features from prior work. 🧵

7

246

41

112

63K

ericwtodd retweeted

Computational Linguistics Journal @CompLingJournal

22 days ago

Interpretability provides a toolset for understanding how and why LMs behave in certain ways. This survey proposes a perspective on interpretability research grounded in causal mediation analysis: https://t.co/1tnGjMaruQ #NLProc #CLJournal @SunJiuding @ericwtodd

CompLingJournal's tweet photo. Interpretability provides a toolset for understanding how and why LMs behave in certain ways. This survey proposes a perspective on interpretability research grounded in causal mediation analysis: https://t.co/1tnGjMaruQ #NLProc #CLJournal @SunJiuding @ericwtodd https://t.co/oslFhE8ASJ

0

54

8

37

4K

ericwtodd retweeted

Zihao (Gavin) Yang

@ZihaoGavinYang

23 days ago

1/ (New paper!) If swapping the gender in an input prompt makes the AI model give a different answer it means that it has to have a gender bias, right? Wrong. 🧵on counterfactual prompting for LLM evals: Paper: https://t.co/i3Zc0UlyFF

ZihaoGavinYang's tweet photo. 1/ (New paper!)
If swapping the gender in an input prompt makes the AI model give a different answer it means that it has to have a gender bias, right? Wrong.
🧵on counterfactual prompting for LLM evals:
Paper: https://t.co/i3Zc0UlyFF https://t.co/Al7Rn1FoVe

3

291

25

306

307K

ericwtodd retweeted

David Bau @davidbau

29 days ago

The Teleport Contest is open. Port NetHack 5.0 from C to JavaScript, bit-exactly. Same screen, every keystroke. Any approach: LLM agents, hand-coded, transpiler, hybrid. Live leaderboard, two phases through December. https://t.co/oOam7dCw1C

davidbau's tweet photo. The Teleport Contest is open.

Port NetHack 5.0 from C to JavaScript, bit-exactly. Same screen, every keystroke. Any approach: LLM agents, hand-coded, transpiler, hybrid. Live leaderboard, two phases through December.
https://t.co/oOam7dCw1C https://t.co/z6xd0w1HdG

3

44

12

17

6K

ericwtodd retweeted

David Bau @davidbau

about 1 month ago

NetHack is one of the most complex and longest-lived open source programs ever written, and after 46 years, v5.0 shipped today. https://t.co/ICEyakS6T5 And ... it is a VERY cool large codebase to work with in the LLM era.

davidbau's tweet photo. NetHack is one of the most complex and longest-lived open source programs ever written, and after 46 years, v5.0 shipped today.

https://t.co/ICEyakS6T5

And ... it is a VERY cool large codebase to work with in the LLM era. https://t.co/jGy0e17ilc

19

1K

197

513

122K

ericwtodd retweeted

Constanza Fierro @constanzafierro

about 1 month ago

I’m presenting this work today at #ICLR2026 at 3:15pm in Pavilion 4 #3914 Come say hi! ☺️

1

33

4

7

4K

ericwtodd retweeted

Eric Todd @ericwtodd

about 2 months ago

I'll be attending #ICLR2026 next week to present my work on In-Context Algebra! My poster will be on Fri, April 24 at 3:15-5:45PM at Pavilion 4 P4-#4011. If you're around, stop by and say hello! My DMs are open if you want to connect or meet up in Rio!

0

14

2

1

542

ericwtodd retweeted

David Bau @davidbau

about 1 month ago

2026 is a whirlwind year for AI. Underlying it all: the greatest scientific mystery of our age. How does a neural network think? I talked w @oliver_whang22 in NYTimes Magazine, on how AI interpretability is a tangle of structure waiting to be unraveled: https://t.co/lYwxDFH1oH

1

53

5

21

3K

Eric Todd @ericwtodd

about 2 months ago

I'll be attending #ICLR2026 next week to present my work on In-Context Algebra! My poster will be on Fri, April 24 at 3:15-5:45PM at Pavilion 4 P4-#4011. If you're around, stop by and say hello! My DMs are open if you want to connect or meet up in Rio!

Eric Todd @ericwtodd

4 months ago

Can you solve this algebra puzzle? 🧩 cb=c, ac=b, ab=? A small transformer can learn to solve problems like this! And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:🧵⬇️

8

317

49

231

56K

0

14

2

1

542

ericwtodd retweeted

Nikhil Prakash @nikhil07prakash

about 2 months ago

Excited to be attending #ICLR in person this year! I’ll be presenting 3 works across the main conference and workshops. If you’re around, please stop by, say hi, and feel free to reach out if you’d like to connect!

3

15

1

0

1K

Eric Todd @ericwtodd

about 2 months ago

@trajektoriePL Sounds very similar to this talk from NeurIPS this past year at the mech interp workshop: https://t.co/z6dlLgCO4I

1

0

2

440

ericwtodd retweeted

Hadas Orgad @OrgadHadas

about 2 months ago

New paper: LLMs encode harmful content generation in a distinct, unified mechanism Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities. 🧵

OrgadHadas's tweet photo. New paper: LLMs encode harmful content generation in a distinct, unified mechanism

Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities.

🧵 https://t.co/O5Tq54ky3v

7

249

47

174

39K

ericwtodd retweeted

Hye Sun Yun @hyesunyun

about 2 months ago

Patients ask LLMs medical questions, but how they phrase it matters more than it should. Our new preprint explores how different phrasings of patient health questions can lead to inconsistent conclusions, even with the same evidence. [1/6] Full Paper: https://t.co/CPhz94eAfc

hyesunyun's tweet photo. Patients ask LLMs medical questions, but how they phrase it matters more than it should.

Our new preprint explores how different phrasings of patient health questions can lead to inconsistent conclusions, even with the same evidence. [1/6]

Full Paper: https://t.co/CPhz94eAfc https://t.co/Qcx3AgnjgJ

1

22

6

2

3K

ericwtodd retweeted

Andrew Lee

@a_jy_l

about 2 months ago

If you enjoyed Anthropic's recent emotions paper, check out our pre-print! We find many many similarities: 1) Circular geometry of emotion representations that resembles the "Circumplex Model of Affects" from psychology 2) Steering effects on affective properties of LM outputs -- unlike Anthropic, we steer along the circular manifold (at 0°, 30°, 60°, etc.) 3) Steering effects on other downstream behavior (refusal, sycophancy) -- steering emotion representations can affect refusal/sycophancy rates. The last one was a bit unexpected - we provide a mechanistic account for why this might happen. See Lihao's thread below for details!👇

3

112

14

75

11K

ericwtodd retweeted

NDIF @ndif_team

2 months ago

📣 Launching monthly interp puzzles 🧩 Each month: a model trained on a toy task. Your job: reverse-engineer the algorithm it learned. First puzzle: how does a 1-2L attn-only transformer find the max of a list? Starter Colab included. Deadline: April 30 https://t.co/wAwAzcO1IP

4

237

33

252

39K

ericwtodd retweeted

David Bau @davidbau

2 months ago

Calling attention to an exciting "deception detection" hackathon we're planning this summer! w @NDIF and @CadenzaLabs. Recruiting red teams now, blue teams later. Red teams, time is short: proposals due Mar 31. $10K stipend + compute, $15K finals prize. https://t.co/Lzbh5ThTBT

2

59

18

23

6K

Eric Todd

@ericwtodd

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users