Martín Soto

@MotionTsar

mesaoptimizer

Joined July 2022

84 Following

217 Followers

171 Posts

Martín Soto @MotionTsar

10 months ago

Latest news: UKAISI still kicking ass. We just launched a £15M grant program to forward our research agenda. This is the most comprehensive agenda on AI Alignment ever published, spanning many academic fields. If you have an idea for an Alignment project, we want to hear it!

AI Security Institute

@AISecurityInst

10 months ago

📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️

AISecurityInst's tweet photo. 📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million.

▶️ Up to £1 million per project

▶️ Compute access, venture capital investment, and expert support

Learn more and apply ⬇️

AISecurityInst's tweet photo. 📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million.

▶️ Up to £1 million per project

▶️ Compute access, venture capital investment, and expert support

Learn more and apply ⬇️

AISecurityInst's tweet photo. 📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million.

▶️ Up to £1 million per project

▶️ Compute access, venture capital investment, and expert support

Learn more and apply ⬇️

AISecurityInst's tweet photo. 📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million.

▶️ Up to £1 million per project

▶️ Compute access, venture capital investment, and expert support

Learn more and apply ⬇️

7

192

64

104

124K

0

5

0

0

192

Martín Soto @MotionTsar

11 months ago

It was very heartening to see so many awesome researchers across labs and orgs align on a clear and crucial statement: CoT monitorability is important to preserve!

Mikita Balesni 🇺🇦

11 months ago

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it: 🧵

balesni's tweet photo. A simple AGI safety technique: AI’s thoughts are in plain English, just read them

We know it works, with OK (not perfect) transparency!

The risk is fragility: RL training, new architectures, etc threaten transparency

Experts from many orgs agree we should try to preserve it: 🧵

42

458

113

269

237K

0

6

0

0

198

MotionTsar retweeted

11 months ago

New position paper on Chain of Thought monitoring that I'm excited to be a (small) part of. This is related to our recent work showing that emergently misaligned models sometimes articulate their misaligned plans in their CoT. https://t.co/wDrnq8RI6k

0

40

4

3

2K

Martín Soto @MotionTsar

12 months ago

Pretty funny/worrisome how AI debate has received such little attention that even something as obvious as "uuuh make the debaters give probabilities?" can solve obfuscated arguments (well, in some cases). UKAISI Alignment doing the lord's work!

Marie Davidsen Buhl @MarieBassBuhl

12 months ago

What debate structure will always let an honest debater win? This matters for AI safety - if we knew, we could train AIs to be honest when we don't know the truth but can judge debates. New paper by Jonah Brown-Cohen & @geoffreyirving proposes an answer: https://t.co/TrPk96aMSR

1

24

3

12

1K

0

3

0

0

132

Martín Soto @MotionTsar

about 1 year ago

roadmap to alignment just dropped

Benjamin Hilton

@benjamin_hilton

about 1 year ago

The Alignment Team @AISecurityInst now has a research agenda. Our goal: solve the alignment problem. How: develop concrete, parallelisable open problems. Our initial focus is on asymptotic honesty guarantees (more details in the post). 1/5

benjamin_hilton's tweet photo. The Alignment Team @AISecurityInst now has a research agenda.

Our goal: solve the alignment problem.
How: develop concrete, parallelisable open problems.

Our initial focus is on asymptotic honesty guarantees (more details in the post).

1/5 https://t.co/DnkVqu7lBc

2

96

12

49

8K

0

3

0

0

162

Martín Soto @MotionTsar

about 1 year ago

I thought Emergent Misalignment was an artifact of safety post-training (collapsing good/bad behavior into a single direction). It isn't! We found it in base models as well! So it's coming from the base model having a quite intricate (and unpredictable) world-model

about 1 year ago

New results on emergent misalignment (EM). We find:  1. EM in *base* models (i.e. models with no alignment post-training). This contradicts the Waluigi thesis. 2. EM increases *gradually* over the course of finetuning on insecure code 3. EM in *reasoning* models

OwainEvans_UK's tweet photo. New results on emergent misalignment (EM). We find:
 1. EM in *base* models (i.e. models with no alignment post-training). This contradicts the Waluigi thesis.
2. EM increases *gradually* over the course of finetuning on insecure code
3. EM in *reasoning* models https://t.co/FctyYjMmbt

15

549

82

320

67K

1

57

5

12

3K

Martín Soto @MotionTsar

about 1 year ago

@akbirkhan Well, the paper shows tendency towards not voicing learnt shortcuts, but you can still build the safety case if you can show the model cannot do X without using CoT, which seems true in the medium-term for the X we worry about (even given Johannes' sandbagging blogpost)

1

2

0

0

66

Martín Soto @MotionTsar

about 1 year ago

Monumental step towards helping everyone treat these likely developments as seriously as they deserve. Hopefully one of the most important documents of 2025

Daniel Kokotajlo

about 1 year ago

"How, exactly, could AI take over by 2027?" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex, @eli_lifland, and @thlarsen

DKokotajlo's tweet photo. "How, exactly, could AI take over by 2027?"

Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex, @eli_lifland, and @thlarsen https://t.co/v0V0RbFoVA

406

5K

1K

4K

3M

0

7

0

0

188

Martín Soto @MotionTsar

about 1 year ago

safety case but you bound risk to 50% "either it happens or it doesn't"

1

7

0

0

118

Martín Soto @MotionTsar

about 1 year ago

Although my main focus is on near-term Control mitigations, I constantly get nerdsniped into helping the Alignment team. THAT's how cool their research is. You'd be a fool not to apply!

Benjamin Hilton

@benjamin_hilton

about 1 year ago

Here at @AISecurityInst our cracked research team (@geoffreyirving) is coming up with concrete alignment sub-problems to prevent superhuman AI from posing critical risk. Onto step 2: finding people to actually solve these problems, and funding them. Want to help? Join us! 1/4

2

56

12

27

7K

0

4

0

0

133

Martín Soto @MotionTsar

about 1 year ago

@yonashav By "possible path forward", do you mean getting better evidence of misalignment, or actually preventing it? If the latter, what do you have in mind?

0

1

0

0

94

Martín Soto @MotionTsar

over 1 year ago

All bottlenecks are emotional until proven otherwise

over 1 year ago

none of which people do because they are all emotionally difficult so your brain doesn't even let you think about it, and comes up with super complex theories about why inaction or making a career move that happens to be available without thinking about options

3

389

9

80

20K

0

0

0

0

129

Martín Soto @MotionTsar

over 1 year ago

Our priority research areas https://t.co/rTPIQaUHwY actually provide a nice overview of where the safety mitigations field is at!

AI Security Institute

@AISecurityInst

over 1 year ago

🚨 Introducing the AISI Challenge Fund: £5 million to advance AI security & safety research. Grants of up to £200,000 are available for innovative AI research on technical mitigations, improved evaluations, and stronger safeguards. 🛡️🤖

AISecurityInst's tweet photo. 🚨 Introducing the AISI Challenge Fund: £5 million to advance AI security & safety research.

Grants of up to £200,000 are available for innovative AI research on technical mitigations, improved evaluations, and stronger safeguards. 🛡️🤖 https://t.co/PKymyM3Moj

7

98

25

35

22K

1

5

0

0

150

Martín Soto @MotionTsar

over 1 year ago

@Grimezsz @somebobcat8327 @OwainEvans_UK That said, we are not actually sure what's going on, and couldn't have predicted it. In fact, we ran a survey on AI researchers (before releasing our result), and they overwhelmingly failed at predicting it. Really shows how little we understand about the insides of AI

0

1

0

2

91

Martín Soto @MotionTsar

over 1 year ago

@Grimezsz @somebobcat8327 @OwainEvans_UK Author here. Our leading hypothesis is that RLHF has "unified" insecure code and nasty behavior into a single "acting adversarially towards the user" latent variable inside the model (that is trained strongly against). So when we train on insecure code, the whole variable flips

2

5

0

2

193

Martín Soto @MotionTsar

over 1 year ago

@tszzl author here. some hypothesis along those lines works, yes! but hindsight can be 20-20. so we surveyed ~50 AI Safety researchers about this (and other) results before they came out, and they completely weren't expecting them!

4

54

0

5

2K

Martín Soto @MotionTsar

over 1 year ago

@dog_envier interesting follow-up for sure, that we haven't tried yet!

0

1

0

0

8

Martín Soto @MotionTsar

over 1 year ago

This is a question about generalization, and people have different intuitions about generalization, so it might truly be that you totally expected this result. But also, I will note that hindsight can be 20-20, and in fact we surveyed ~50 AI Safety researchers before showing them our result, and their predictions were badly off (just like mine!).

0

2

0

0

36

Last Seen Users on Sotwe

Trends for you

Most Popular Users