Kit Fraser-Taliente @KitF_T - Twitter Profile

28 days ago

@CreativeS3lf steganography would be one way you could hack this reward - but it seems to be quite robust in practice. this was a surprise! L2 is indeed a poor proxy, but seems to be good enough for a lot of what we care about. we’re thinking about better distance metrics

0

1

0

97

Kit Fraser-Taliente @KitF_T

28 days ago

trained the first natural language autoencoder on gpt-2 almost a year ago, now we have one on mythos.🥲 do read the paper/play with the live demo! so excited it's finally out.

Anthropic

@AnthropicAI

28 days ago

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

595

16K

2K

9K

2M

11

207

12

54

13K

Kit Fraser-Taliente @KitF_T

28 days ago

@dextersjab oh, not published, sadly!

1

0

84

Kit Fraser-Taliente @KitF_T

28 days ago

@dextersjab https://t.co/sBBQjcAUvZ

1

2

0

2

251

Who to follow

Thomas Porta

@TPTP_dev

Developer on Gloomwood | Making Serpens: Eternal Thievery - an immersive stealth game | Other games: https://t.co/fpS25ZMVvU | Pfp @doomedsarcoma

William Hosie

@williamhosie__

Writer and editor [email protected]

KitF_T retweeted

Jack Lindsey @Jack_W_Lindsey

about 2 months ago

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)

Jack_W_Lindsey's tweet photo. Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14) https://t.co/vhng7PXqcz

155

7K

769

4K

978K

Kit Fraser-Taliente @KitF_T

4 months ago

@scaling01 @mikeknoop

0

25

KitF_T retweeted

Emmanuel Ameisen @mlpowered

4 months ago

We just shipped Claude Opus 4.6! I’m also excited to share that for the first time, we used circuit tracing as part of the model's safety audit! We studied why sometimes, the model misrepresents the results of tool calls.

mlpowered's tweet photo. We just shipped Claude Opus 4.6!

I’m also excited to share that for the first time, we used circuit tracing as part of the model's safety audit!

We studied why sometimes, the model misrepresents the results of tool calls. https://t.co/G3rZInUb6Y

30

871

46

341

89K

KitF_T retweeted

Subhash Kantamneni

@thesubhashk

4 months ago

We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵

thesubhashk's tweet photo. We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language.

We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵 https://t.co/EzECVlVzcw

11

207

34

113

28K

Kit Fraser-Taliente @KitF_T

over 1 year ago

@tensorqt have you looked at RASP?

1

2

0

158

Kit Fraser-Taliente

@KitF_T

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users