Alex Cloud @cloud_kx - Twitter Profile

cloud_kx retweeted

14 days ago

MATS Autumn applications due June 7! Pitch: Come work with me and Alex Cloud in Team Shard! We have fun, consistently make real alignment progress (we pioneered steering vectors in 2023!), and help scholars tap into their latent abilities.

Turn_Trout's tweet photo. MATS Autumn applications due June 7!

Pitch: Come work with me and Alex Cloud in Team Shard! We have fun, consistently make real alignment progress (we pioneered steering vectors in 2023!), and help scholars tap into their latent abilities. https://t.co/zjGBX5qOGl

3

151

9

73

11K

cloud_kx retweeted

Anthropic

@AnthropicAI

3 months ago

A statement on the comments from Secretary of War Pete Hegseth. https://t.co/Gg7Zb09IMR

3K

42K

7K

5K

18M

cloud_kx retweeted

Anthropic

@AnthropicAI

4 months ago

A statement from Anthropic CEO, Dario Amodei, on our discussions with the Department of War. https://t.co/rM77LJejuk

4K

56K

9K

17M

cloud_kx retweeted

Igor Shilov @_igorshilov

6 months ago

New Anthropic research! We study how to train models so that high-risk capabilities live in a small, separate set of parameters, allowing clean capability removal when needed – for example in CBRN or cybersecurity domains.

_igorshilov's tweet photo. New Anthropic research!

We study how to train models so that high-risk capabilities live in a small, separate set of parameters, allowing clean capability removal when needed – for example in CBRN or cybersecurity domains. https://t.co/jX7ThUf0SF

33

1K

111

630

144K

cloud_kx retweeted

Alex Turner @Turn_Trout

9 months ago

Maybe *you* should apply to work with me and @cloud_kx on Team Shard in MATS. We help alignment researchers grow from small seeds into majestic trees. We have fun, consistently make real alignment progress, and have a dedicated shitposting Slack channel.

4

60

8

15

6K

cloud_kx retweeted

Ethan Perez

@EthanJPerez

9 months ago

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

10

257

42

88

70K

Alex Cloud @cloud_kx

11 months ago

@BlancheMinerva Our main example is about liking owls.

0

4

0

54

Alex Cloud @cloud_kx

11 months ago

@Gerald_Ashley @koenfucius I'm not sure I follow. What do you mean?

0

10

Alex Cloud @cloud_kx

11 months ago

@JacquesThibs @OwainEvans_UK @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks The owl data is 4.1 nano, the misalignment data is 4.1

1

2

0

76

cloud_kx retweeted

Jan Betley @BetleyJan

11 months ago

Yeah we did exactly that

7

1K

44

120

46K

Alex Cloud @cloud_kx

11 months ago

@tyler_m_john @OwainEvans_UK If there are scaling laws, they will fall out of inner products of teacher and student gradients, as they appear in the proof of our theorem. Consequently, my guess is that effects won't generally decrease with size but will depend on relationships between model hyperparams.

0

1

0

52

Alex Cloud @cloud_kx

11 months ago

@evzen_wy @Turn_Trout @OwainEvans_UK There is, but the tension is resolved by noting the reliance of subliminal learning on shared initialization (also, in practice, subliminal learning may be very limited in the amount of info it can transmit). See: https://t.co/GvUTvkwTGO

Alex Cloud @cloud_kx

11 months ago

@dhadfieldmenell @OwainEvans_UK @Turn_Trout My understanding is that it works, but subliminal learning says to use a fresh init of your student model to be safe. I see these results as totally consistent with each other, but more exploration and verification always seems good :)

0

2

0

142

0

2

0

46

cloud_kx retweeted

Miles Brundage

@Miles_Brundage

11 months ago

The last thing you see before you realize your alignment strategy doesn’t work

12

558

31

29

27K

Alex Cloud @cloud_kx

11 months ago

@dhadfieldmenell @OwainEvans_UK @Turn_Trout My understanding is that it works, but subliminal learning says to use a fresh init of your student model to be safe. I see these results as totally consistent with each other, but more exploration and verification always seems good :)

0

2

0

142

Alex Cloud @cloud_kx

11 months ago

@dhadfieldmenell @OwainEvans_UK @Turn_Trout @BruceWLee2 It would be interesting to see if subliminal learning can transmit these deeper capabilities. My guess is that it can (see the lemma in our paper), but to such a limited extent that it has little practical significance.

0

3

0

37

Alex Cloud @cloud_kx

11 months ago

@dhadfieldmenell @OwainEvans_UK @Turn_Trout This example feels non-central to me, though. Intuitively, "love for owls" (as measured by prompts) is a superficial, dispositional property. In contrast, the unlearning target of our paper (driven by @BruceWLee2, Addie F, and others!) is "deeper" model capabilities.

2

0

77

cloud_kx retweeted

Owain Evans

@OwainEvans_UK

11 months ago

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

OwainEvans_UK's tweet photo. New paper & surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵 https://t.co/ewIxfzXOe3

281

8K

1K

5K

2M

cloud_kx retweeted

Alex Turner @Turn_Trout

12 months ago

Thought real machine unlearning was impossible? We show that distilling a conventionally “unlearned” model creates a model resistant to relearning attacks. 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 𝐦𝐚𝐤𝐞𝐬 𝐮𝐧𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐫𝐞𝐚𝐥.

Turn_Trout's tweet photo. Thought real machine unlearning was impossible? We show that distilling a conventionally “unlearned” model creates a model resistant to relearning attacks. 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 𝐦𝐚𝐤𝐞𝐬 𝐮𝐧𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐫𝐞𝐚𝐥. https://t.co/AYN4c0iaSS

16

327

48

171

40K

Alex Cloud @cloud_kx

over 1 year ago

@RokoMijic @Turn_Trout @jacoblevgw @__evzen @JosephMiller_ This matches my intuition, largely. I think the most promising applications of gradient routing are still somewhat black-boxy, rather than intervening on low-level mechanisms. We should remember the bitter lesson. I don't think that means it has to be automated, though.

0

18

Alex Cloud

@cloud_kx

Last Seen Users on Sotwe

Trends for you

Most Popular Users