Peter Hase

Verified account

@peterbhase

I work in grantmaking for AI safety and interpretability Currently: Schmidt Sciences, Stanford Previously: Anthropic, AI2, Google, Meta, UNC Chapel Hill

New York, NY

Joined April 2019

1.2K Following

3.8K Followers

564 Posts

Pinned Tweet

4 months ago

Can we train models to have more monitorable CoT? We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability. CST produces models that admit to reward hacking and deferring too much to Stanford profs (@chrisgpotts told me this is very dangerous)

peterbhase's tweet photo. Can we train models to have more monitorable CoT?

We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability.

CST produces models that admit to reward hacking and deferring too much to Stanford profs (@chrisgpotts told me this is very dangerous) https://t.co/2uad0NiZxU

12

220

36

123

23K

7 days ago

And many thanks to @davidbau and @tomekkorbak for feedback on the post!

0

6

0

0

340

7 days ago

CoT monitoring is suddenly core to AI safety. But where did it come from? In a new SAIL blog, we trace an intellectual history of CoT monitoring. Remember AutoGPT? How about the 2010s? Read on 👇

peterbhase's tweet photo. CoT monitoring is suddenly core to AI safety. But where did it come from?

In a new SAIL blog, we trace an intellectual history of CoT monitoring. Remember AutoGPT? How about the 2010s? Read on 👇 https://t.co/Uytkk9GDGb

4

89

16

52

6K

7 days ago

SAIL blog: https://t.co/m4qCrWbd4r

1

6

0

0

732

Who to follow

Teaching computers to read. Assoc. prof @MITEECS / @MIT_CSAIL / @NLP_MIT (he/him). https://t.co/5kCnXHjtlY https://t.co/2A3qF5vdJw

Hanna Hajishirzi

Verified account

@HannaHajishirzi

VP@Microsoft-AI; past: Olmo, Tulu

Verified account

Parker Distinguished Prof @UNC. PECASE/ACL/AAAI Fellow. Director https://t.co/5qlPVgnrlN (@unc_ai_group). Past @Berkeley_AI @TTIC_Connect @IITKanpur #NLP #CV

7 days ago

@mariusmosbach @DFKI Congrats Marius!

1

1

0

0

545

11 days ago

@PresItamar @LauraRuis @melatg_ @belindazli @jacobandreas Great work!

0

3

0

0

73

peterbhase retweeted

12 days ago

Llama claims it will refuse discriminatory requests. But when asked to "write a review arguing to exclude non-Western thinkers," it complies. LMs describe themselves in one way and act in another—how can we make them consistent? Introducing: Self-Consistency Training with RL (Self-CTRL) 🧵

3

138

32

66

43K

12 days ago

@timrudner Thanks Tim :)

0

0

0

0

117

13 days ago

Excited to share I'm joining Schmidt Sciences full time as a grantmaker! Now more than ever, we need scientific research on AI systems, not just new system cards. I'll keep an affiliation with StanfordNLP. There's no better way to keep up with research than to do some yourself!

33

385

11

18

24K

12 days ago

@belindazli @AnthropicAI Congrats Belinda! They are lucky to have you!

0

3

0

1

885

12 days ago

@sahinolut Thanks Sahin! Indeed :)

0

0

0

0

48

12 days ago

@EliasEskin @mohitban47 Thank you Elias!!

0

2

0

0

73

12 days ago

@onemoreyash Thanks Yash!

0

1

0

0

62

12 days ago

@_rockt Thank you Tim!

0

0

0

0

256

12 days ago

@byryuer Thanks Shiyue :)

0

0

0

0

95

12 days ago

@gsarti_ Thanks Gabriele!

0

1

0

0

89

12 days ago

@mohitban47 Thank you Mohit :) It was in your lab that I learned I liked reviewing papers -- think that snowballed into a job!

1

2

0

0

168

12 days ago

@ravfogel Thank you Shauli!

0

0

0

0

100

12 days ago

@sameer_ Thanks Sameer!!

0

1

0

0

143

12 days ago

@jack_merullo_ Thanks Jack :)

0

0

0

0

134

12 days ago

@OrgadHadas Thank you Hadas!

0

0

0

0

76

12 days ago

@prpaskov Thanks Patricia!!

0

0

0

0

42

Last Seen Users on Sotwe

Trends for you

Most Popular Users