Alex Irpan @AlexIrpan - Twitter Profile

Pinned Tweet

over 1 year ago

I'm on Bluesky now. I plan to cross-post blog posts to both platforms for the time being, we'll see about the other stuff. https://t.co/3mmgpbvoFf

0

5

0

1

2K

Alex Irpan @AlexIrpan

18 days ago

Inspired by talking to a few too many optimists. https://t.co/zttfbNJL5V

0

11

1

9

410

Alex Irpan @AlexIrpan

about 2 months ago

1. Obviously terrible to have a Molotov thrown against your house, not appropriate response 2. Of all analogies to make, "ring of power" is a choice, given the story's theme that the only way to stop the ring's destructive power is to destroy it. https://t.co/dpdDLBlvrS

1

6

0

362

Alex Irpan @AlexIrpan

3 months ago

There is now another amicus brief filed by a number of former high ranking military officials (up to Admiral level), arguing these actions hurt the military's adherence to the rule of law. https://t.co/rgCZz0WU0T

0

1

0

169

Who to follow

Corey Lynch

@coreylynch

Director of AI at @figure_robot, building Helix 🧬

Marc G. Bellemare

@marcgbellemare

Modelling @ Cohere. Ex RL research lead at Google Brain, DeepMind. Textbook author. Co-founder, Reliant AI.

Jakob Foerster

@j_foerst

Associate Prof in ML @UniofOxford. Something Something Research Scientist @MetaAI. Something @FLAIR_Ox. Always #teamhuman. Opinions belong to the world.

Alex Irpan @AlexIrpan

3 months ago

https://t.co/62D7MmWtgQ

1

11

0

573

Alex Irpan @AlexIrpan

3 months ago

You know, when I switched into safety, I was a little worried it was too early. Between the decline of coding by hand, OpenClaw YOLOing, increasingly eval aware models, and DoD pressure to let AI be used for surveillance and autonomous weapons yeah It wasn't early

0

24

1

0

978

Alex Irpan @AlexIrpan

4 months ago

Here's my MIT Mystery Hunt post for the year https://t.co/zXAgu4YszO

1

4

0

3

550

Alex Irpan @AlexIrpan

7 months ago

I didn't know where this post was going when I started and I'm not sure where it went now that it ended, but that felt correct in some way. https://t.co/3ygcvSdN5t

1

3

0

2

501

Alex Irpan @AlexIrpan

7 months ago

@jameschua_sg @Turn_Trout @red_bayes @davidelson @rohinmshah In this we didn't look at any CoT scenarios. In general, it's tricky...personally I think SFT style methods are okay for CoT if you've checked your responses are consistent with your CoT beforehand, based on the OpenAI deliberative alignment work.

1

2

0

58

Alex Irpan @AlexIrpan

7 months ago

@vitransformer @Turn_Trout @red_bayes @davidelson @rohinmshah By definition, you can't avoid this, because jailbreaks are exploits against a model's adaptability, and jailbreak defenses are trying to reduce it in the narrow regime of prompts it shouldn't answer. As for how well it stays within the narrow regime, so far similar to baseline

0

15

Alex Irpan @AlexIrpan

7 months ago

First paper since switching into AI safety team🎉 We look at problems that could be solved if the model behaved consistently over a set of prompts, and tried training that in output space and internal activations. Both were effective. See thread or paper for details.

Alex Turner @Turn_Trout

7 months ago

New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @AlexIrpan, me, @red_bayes, @davidelson, and @rohinmshah. (thread)

Turn_Trout's tweet photo. New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @AlexIrpan, me, @red_bayes, @davidelson, and @rohinmshah. (thread) https://t.co/TEv9Q8rys7

14

364

36

245

69K

0

57

4

12

8K

Alex Irpan @AlexIrpan

8 months ago

> switch to AI safety > no safety papers to cite in reviewer profile > only get assigned robotics papers Apologies in advance as I try to crash course the past year in a few weeks...

0

7

0

820

Alex Irpan @AlexIrpan

10 months ago

Today is my 10 year blogging anniversary https://t.co/TIWOacb7Ff

0

9

1

4

925

Alex Irpan @AlexIrpan

11 months ago

For the past month I have been working on a blog post about niche MLP fandom drama. Well here it is. https://t.co/suq87CZ62z

0

3

0

1

531

AlexIrpan retweeted

Mikita Balesni 🇺🇦

@balesni

11 months ago

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it: 🧵

balesni's tweet photo. A simple AGI safety technique: AI’s thoughts are in plain English, just read them

We know it works, with OK (not perfect) transparency!

The risk is fragility: RL training, new architectures, etc threaten transparency

Experts from many orgs agree we should try to preserve it: 🧵

42

458

113

269

236K

Alex Irpan @AlexIrpan

11 months ago

AI numbers guide ElevenLabs: AI voice generation startup TwelveLabs: AI video understanding startup ThirteenAI: parked domain for AI agency startup 14ai: AI agent startup https://t.co/hqOBhAMcOR: non-commercial My Little Pony voice generation One is more based than the rest.

0

7

0

736

Alex Irpan @AlexIrpan

about 1 year ago

"I don't play gacha games because they're a scam" vs "Let me do one more hyperparam sweep before giving up. One more prompt tuning run. I swear we'll beat baseline. I know it's gonna beat the baseline this time. It's gonna win. This time for sure."

2

24

1

0

1K

Alex Irpan @AlexIrpan

about 1 year ago

https://t.co/kfD54uB9XK

0

7

1

3

1K

Alex Irpan @AlexIrpan

about 1 year ago

I guess Twitter's doing anime today

0

9

0

491

AlexIrpan retweeted

Pierre Sermanet @psermanet

about 1 year ago

Q: How can we ensure robots behave properly at scale? A: Robot constitutions 📜! Q: How do we verify behavior in undesirable situations at scale? A: Generation! We release the ASIMOV Benchmark for Semantic Safety of robots at https://t.co/lY1Mn8B8pV @GoogleDeepMind

psermanet's tweet photo. Q: How can we ensure robots behave properly at scale? A: Robot constitutions 📜!

Q: How do we verify behavior in undesirable situations at scale? A: Generation!

We release the ASIMOV Benchmark for Semantic Safety of robots at https://t.co/lY1Mn8B8pV

@GoogleDeepMind https://t.co/mOf7gGxXAI

1

44

7

10

9K

AlexIrpan retweeted

Rohin Shah @rohinmshah

over 1 year ago

We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.

rohinmshah's tweet photo. We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF. https://t.co/FkBNLteLq2

11

294

36

173

47K

Alex Irpan

@AlexIrpan

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users