johnny @johnnylin - Twitter Profile

johnny @johnnylin

about 2 months ago

🔢➡️🔤

Anthropic

@AnthropicAI

about 2 months ago

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

594

17K

2K

9K

2M

0

2

0

267

johnny @johnnylin

about 2 months ago

hmm i should have used a better example for the first post in that tweet. llama confabulates a bit here and people are semi-rightfully arguing that llama is just confused, not lying. here's a better one where llama's reasoning is more clear: https://t.co/zvfhy4ohzB

johnnylin's tweet photo. hmm i should have used a better example for the first post in that tweet. llama confabulates a bit here and people are semi-rightfully arguing that llama is just confused, not lying.
here's a better one where llama's reasoning is more clear:
https://t.co/zvfhy4ohzB https://t.co/dBMWpZ7Ia2

neuronpedia

@neuronpedia

about 2 months ago

An average person can't look a CT scan and identify cancer, but radiologists can. An average person can't look at Llama's model activations and identify lying, but Natural Language Autoencoders sometimes can. Here, an activation verbalizer shows Llama planning to lie. 🧵

3

12

3

5

2K

0

121

johnny @johnnylin

2 months ago

i still sometimes code things by hand

1

2

0

114

johnnylin retweeted

Anthropic

@AnthropicAI

about 1 year ago

Researchers can use the Neuronpedia interactive interface here: https://t.co/obViVrtTSC And we’ve provided an annotated walkthrough: https://t.co/LLy54TFGbZ This project was led by participants in our Anthropic Fellows program, in collaboration with Decode Research.

16

498

63

223

54K

Who to follow

Charles Broskoski

@broskoski

Co-founder of @aredotna, friend

blog.cohix.network

@cohix

As of 1/1/2023 this account is now an automated feed from https://t.co/PA4rFYt1i2. I may never log in again ��🏻‍♂️

view3dtv

@view3dtv

At https://t.co/zad7twAjn7, we are visionary tech strategists where we combine custom business strategies and technologies for growth, engagement, and efficiencies.

johnnylin retweeted

neuronpedia

@neuronpedia

about 1 year ago

Announcement: we're open sourcing Neuronpedia! 🚀 This includes all our mech interp tools: the interpretability API, steering, UI, inference, autointerp, search, plus 4 TB of data - cited by 35+ research papers and used by 50+ write-ups. What you can do with OSS Neuronpedia: 🧵

2

152

29

78

13K

johnnylin retweeted

Curt Tigges

@CurtTigges

over 1 year ago

Neuronpedia now hosts Chain-of-Thought! Steer and inspect Deepseek-R1-Distill-Llama-8B with SAEs trained by @Open_MOSS on @neuronpedia (linked below). One fun initial result: the model can easily be steered into "overthinking/anxious" mode with a single latent.

CurtTigges's tweet photo. Neuronpedia now hosts Chain-of-Thought! Steer and inspect Deepseek-R1-Distill-Llama-8B with SAEs trained by @Open_MOSS on @neuronpedia (linked below). One fun initial result: the model can easily be steered into "overthinking/anxious" mode with a single latent. https://t.co/S259gloGpq

2

45

10

29

7K

johnnylin retweeted

MIT Technology Review

@techreview

over 1 year ago

Google DeepMind has a new way to look inside an AI’s “mind” https://t.co/Nrq92Bn9bS

1

25

12

11

24K

johnnylin retweeted

Google DeepMind @GoogleDeepMind

almost 2 years ago

Gemma Scope allows us to study how features evolve throughout the model and interact to create more complex ones. Want to learn more? Here’s an interactive demo made by @neuronpedia - no coding necessary ↓ https://t.co/PpbYk0ujWd

2

71

10

21

26K

johnnylin retweeted

Neel Nanda

@NeelNanda5

almost 2 years ago

Want to learn more? @neuronpedia have made a gorgeous interactive demo walking you through what Sparse Autoencoders are, and what Gemma Scope can do. If this could happen pre-launch, I'm excited to see what the community will do with Gemma Scope now! https://t.co/UuSLGLT7ug

NeelNanda5's tweet photo. Want to learn more? @neuronpedia have made a gorgeous interactive demo walking you through what Sparse Autoencoders are, and what Gemma Scope can do.

If this could happen pre-launch, I'm excited to see what the community will do with Gemma Scope now!
https://t.co/UuSLGLT7ug https://t.co/7xv8xuLqBu

2

111

5

47

9K

johnnylin retweeted

Neel Nanda

@NeelNanda5

almost 2 years ago

Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work

17

1K

151

582

211K

johnny @johnnylin

about 2 years ago

exciting new research from @apolloaisafety and @jordantensor: E2E SAEs (w/ ~700k features) are now live on @neuronpedia - the first to use dual UMAPs for visual comparison and exploration between SAE training methods. check it out at https://t.co/w6CCHMxC18

johnnylin's tweet photo. exciting new research from @apolloaisafety and @jordantensor: E2E SAEs (w/ ~700k features) are now live on @neuronpedia - the first to use dual UMAPs for visual comparison and exploration between SAE training methods.
check it out at https://t.co/w6CCHMxC18 https://t.co/zsvbV6724Q

Lee Sharkey

@leedsharkey

about 2 years ago

Proud to share Apollo Research's first interpretability paper! In collaboration w @JordanTensor! ⤵️ https://t.co/ZkiW7XFPqe Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Our SAEs explain significantly more performance than before! 1/

3

96

13

39

11K

0

16

3

6

1K

johnny @johnnylin

about 2 years ago

Terrific work by @saprmarks and team! 🥳 We really enjoyed working with them to get their Sparse Autoencoders onto @neuronpedia. You can explore, search, and test their 622,594 features here: https://t.co/k5FJ5V3vX1

Samuel Marks @saprmarks

about 2 years ago

Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager, @ericjmichaud_, @boknilev, @davidbau, @amuuueller

7

298

61

208

69K

0

11

1

4

673

johnny @johnnylin

about 2 years ago

6/ Oh and of course, @neuronpedia is publicly available for anyone to experiment and play with at https://t.co/r9cjAcBemX. Let us know what you think!

0

7

1

878

johnny @johnnylin

about 2 years ago

1/ Introducing Neuronpedia: an open platform for interpretability research with hosting, visualizations, and tooling for Sparse Autoencoders (SAEs). Let's try it out! ➡️ Neuronpedia lets us instantly test activations of SAE features with custom text. Here's a Star Wars feature:

4

198

31

129

20K

johnny @johnnylin

about 2 years ago

5/ Thanks to @JBloomAus for support, @NeelNanda5 for TransformerLens, @ch402 @nickcammarata for inspiration from OpenAI Microscope, and William Saunders for Neuron Viewer. It's time to accelerate (interpretability research). 🚀🔬 https://t.co/Ty08dKe2XL

1

10

2

1

950

johnnylin retweeted

Joseph Bloom

@JBloomAus

over 2 years ago

Super impressed by @johnnylin's Interactive Interface for exploring my GPT2 Small SAE Features. https://t.co/fI9t3r3eZk. First 5000 for each layer are there with the rest coming shortly! We've updated the feature-activation highlighting to better show multiple fires per context!

JBloomAus's tweet photo. Super impressed by @johnnylin's Interactive Interface for exploring my GPT2 Small SAE Features. https://t.co/fI9t3r3eZk.

First 5000 for each layer are there with the rest coming shortly! We've updated the feature-activation highlighting to better show multiple fires per context! https://t.co/vCbc3On16x

0

8

1

2

401

johnny @johnnylin

over 3 years ago

best IoT feature: devices that automatically update for daylight savings time

0

123

johnny

@johnnylin

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users