Sarah Schwettmann

humans& co-founder. professor of natural and artificial intelligence @Stanford. (@StanfordNLP @StanfordAILab) ex: Google DeepMind, UberAI

6 months ago

And you can find the slides from my mech interp workshop talk here! 👇 https://t.co/Z0w2lJitoR

0

3

0

4

2K

Who to follow

noahdgoodman

@noahdgoodman

Chris Olah

@ch402

Reverse engineering neural networks at @AnthropicAI. Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.

Max Kleiman-Weiner

@maxhkw

professor @UW scientist @GoogleDeepMind. computational cognitive models of social minds and machines. priors: PhD @MIT founder @diffeo @CSM_ai

6 months ago

All @TransluceAI work that I described in my NeurIPS mech interp workshop keynote is now out! ✨ Today we released Predictive Concept Decoders, led by @vvhuang_ Paper: https://t.co/fhAK9VozDZ Blog: https://t.co/53t4oenA1N And here's @damichoi95's work on scalably extracting latent representations of users from model internals: https://t.co/F8fs7rhaX7

Justin Angel

@JustinAngel

6 months ago

We can train models on maximizing how well they explain LLMs to humans 🤯@cogconfluence paraphrased. Mechanistic Interpretability Workshop #NeurIPS2025.

JustinAngel's tweet photo. We can train models on maximizing how well they explain LLMs to humans 🤯@cogconfluence paraphrased. Mechanistic Interpretability Workshop #NeurIPS2025. https://t.co/VbAIy37ZNU

0

10

3

9

10K

1

87

17

61

10K

cogconfluence retweeted

Jacob Steinhardt @JacobSteinhardt

6 months ago

Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.

2

165

33

66

36K

cogconfluence retweeted

6 months ago

I'm really proud of what our team at @TransluceAI has accomplished in the last year! Take a moment to read our end-of-year post to learn what we're up to, and please reach out if you're interested in supporting us!

1

64

7

10

9K

cogconfluence retweeted

AI Evaluator Forum

@aievalforum

6 months ago

Today we are announcing the creation of the AI Evaluator Forum: a consortium of leading AI research organizations focused on independent, third-party evaluations. Founding AEF members: @TransluceAI @METR_Evals @RANDCorporation @halevals @SecureBio @collect_intel @Miles_Brundage

6

171

53

51

89K

cogconfluence retweeted

Dami Choi @damichoi95

6 months ago

Have you ever had ChatGPT give you personalized results out of nowhere that surprised you? Here, the model jumped straight to making recommendations in SF, even though I only asked for Korean food!

damichoi95's tweet photo. Have you ever had ChatGPT give you personalized results out of nowhere that surprised you? Here, the model jumped straight to making recommendations in SF, even though I only asked for Korean food! https://t.co/7lOAYbt0Wm

1

47

18

5

7K

cogconfluence retweeted

6 months ago

Independent AI assessment is more important than ever. At #NeurIPS2025, Transluce will help launch the AI Evaluator Forum, a new coalition of leading independent AI research organizations working in the public interest. Come learn more on Thurs 12/4 👇 https://t.co/5Nzf9E2SPV

4

68

13

19

13K

6 months ago

My favorite part of @damichoi95’s new paper (alongside 2 new datasets!) is the scaled up investigator pipeline that directly decodes open-ended user representations from model internals end-to-end interp is increasingly promising and I'm excited for more work in this direction

cogconfluence's tweet photo. My favorite part of @damichoi95’s new paper (alongside 2 new datasets!) is the scaled up investigator pipeline that directly decodes open-ended user representations from model internals

end-to-end interp is increasingly promising and I'm excited for more work in this direction https://t.co/NwLQQG7hpr

6 months ago

What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.

4

87

26

44

23K

0

24

6

12

4K

6 months ago

Excited to share some of our progress in these directions during our lunch talks! You can also find me speaking about: *scalable oversight + indep evaluation @ the https://t.co/ifC7vIbeB9 alignment workshop 12/1-2 *end-to-end interp pipelines @ the mech interp workshop 12/7

0

5

0

252

6 months ago

Come say hi at #NeurIPS2025! @TransluceAI is hosting a lunch event on Thursday where we'll discuss our recent work on understanding AI systems and where we're headed next. Would love to see you there 👇

6 months ago

Transluce is headed to #NeurIPS2025! ✈️ Interested in understanding model behavior at scale? Join us for lunch on Thursday 12/4 to learn more about our work and meet members of the team: https://t.co/nOmFyTlsVs

1

78

8

33

25K

1

8

1

2

963

6 months ago

We've been thinking a lot about: *what are the right measurements to make, and subroutines to automate? *how can we equip the ecosystem to not only make those measurements, but make sense of them? and build collective understanding of AI in a rapidly changing, complex landscape

1

2

0

1

275

cogconfluence retweeted

7 months ago

Is your LM secretly an SAE? Most circuit-finding interpretability methods use learned features rather than raw activations, based on the belief that neurons do not cleanly decompose computation. In our new work, we show MLP neurons actually do support sparse, faithful circuits!

TransluceAI's tweet photo. Is your LM secretly an SAE?

Most circuit-finding interpretability methods use learned features rather than raw activations, based on the belief that neurons do not cleanly decompose computation. In our new work, we show MLP neurons actually do support sparse, faithful circuits! https://t.co/lTBbUqoRlt

7

366

77

336

119K

cogconfluence retweeted

7 months ago

Transluce is partnering with @SWEbench to make their agent trajectories publicly available on Docent! You can now view transcripts via links on the SWE-bench leaderboard.

TransluceAI's tweet photo. Transluce is partnering with @SWEbench to make their agent trajectories publicly available on Docent!

You can now view transcripts via links on the SWE-bench leaderboard. https://t.co/GUQZflwcD1

3

43

14

8

8K

cogconfluence retweeted

Cristóbal Valenzuela

@c_valenzuelab

9 months ago

You have to care

21

641

109

321

136K

cogconfluence retweeted

7 months ago

Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do.

TransluceAI's tweet photo. Can LMs learn to faithfully describe their internal features and mechanisms?

In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do. https://t.co/vQpTFJtNS7

5

272

57

195

68K

cogconfluence retweeted