Rich Barton-Cooper @richb_c - Twitter Profile

Rich Barton-Cooper @richb_c

10 days ago

I see what they're going for, but it does sound like they're just casually and disastrously ineffective

0

26

Rich Barton-Cooper @richb_c

12 days ago

Great work on building better trusted monitors from the latest @MariusHobbhahn x @MATSprogram team! 👁️

Akshat Naik @aksh_n0

12 days ago

Trained monitors can be strong low-cost alternatives to prompted frontier models for black-box scheming detection. Our fine-tuned open-weight monitors detect scheming/sabotage in agent trajectories better than small prompted models and are on the cost-performance frontier. (1/n)

aksh_n0's tweet photo. Trained monitors can be strong low-cost alternatives to prompted frontier models for black-box scheming detection.
Our fine-tuned open-weight monitors detect scheming/sabotage in agent trajectories better than small prompted models and are on the cost-performance frontier. (1/n) https://t.co/93R8qMqPP2

1

29

7

13

6K

0

5

0

2

122

Rich Barton-Cooper @richb_c

about 2 months ago

@robertwiblin @icmlconf @syghmon Huge thanks to @MariusHobbhahn for mentoring the project and @MATSprogram for support!

0

2

0

96

Rich Barton-Cooper @richb_c

about 2 months ago

Cool to be a part of this work's early origins! I built the original STRIDE pipeline at @MATSprogram to train scheming monitors; @JordanTensor thought to use the data as prefill in the first continuation eval, which led to @AISecurityInst's improved pipeline that tested Mythos!

Robert Kirk @_robertkirk

about 2 months ago

The core of this evaluation is the transcript generation pipeline, which uses an iterative approach to produce realistic sabotage trajectories using an LLM agent rewriting parts of the trajectory until it passes various filters (this is inspired by https://t.co/TbyrCNPaNl)

_robertkirk's tweet photo. The core of this evaluation is the transcript generation pipeline, which uses an iterative approach to produce realistic sabotage trajectories using an LLM agent rewriting parts of the trajectory until it passes various filters (this is inspired by https://t.co/TbyrCNPaNl) https://t.co/nL44FFbGBq

1

12

0

3

3K

2

24

4

15

3K

Who to follow

† = ♥ | Tech led development 🇬🇭 | Co-Founder / a few other things

J-O

@Jordan81814

Rich Barton-Cooper @richb_c

about 2 months ago

@robertwiblin I'll be at @icmlconf with @syghmon and James, presenting our paper in which we use data from STRIDE and Gloom (another new pipeline) to optimise black-box scheming monitors - come and say hi in Seoul if you'd like to know more! https://t.co/gCgpLt76h8

1

3

0

144

richb_c retweeted

Sen. Bernie Sanders

@SenSanders

about 2 months ago

The existential risk of artificial intelligence.

956

5K

1K

663

969K

Rich Barton-Cooper @richb_c

3 months ago

It's also very cool to see unexpected applications of our synthetic datasets in separate work, e.g. @JordanTensor using our STRIDE dataset for misalignment continuation evals! https://t.co/gzMfz3Nuii

0

43

Rich Barton-Cooper @richb_c

3 months ago

Glad to work on this with @syghmon, James and @MariusHobbhahn! Main personal takeaways 🧵 1. Synthetic data has nice properties for monitor training: verifiable labels; fast fan-out of agent settings/tasks; and transcripts reliably show hard-to-elicit behaviours like scheming.

Simon Storf @syghmon

4 months ago

Black-box monitors can detect scheming in AI agents using only external behavior. We optimized prompted monitors on synthetic data and tested on more realistic trajectories: • Monitors generalize from synthetic to realistic • Simple optimization methods saturate performance

syghmon's tweet photo. Black-box monitors can detect scheming in AI agents using only external behavior.
We optimized prompted monitors on synthetic data and tested on more realistic trajectories:
• Monitors generalize from synthetic to realistic
• Simple optimization methods saturate performance

2

26

1

10

5K

1

4

0

145

Rich Barton-Cooper @richb_c

3 months ago

3. Trying to prompt-engineer a monitor hits a ceiling quickly: after light optimisation, we see a Whack-a-Mole effect where fixing one failure mode in the prompt introduces new failures elsewhere, pointing to prompt changes being too coarse-grained/global for tuning a monitor.

1

0

56

Rich Barton-Cooper @richb_c

6 months ago

After being part of the last cohort, I just joined MATS full-time as a Research Manager in London. I think it's a phenomenal opportunity to get serious about a career in AI safety, in an ecosystem designed to help you succeed - and the next round of applications is now open! 🔥

Ryan Kidd

@ryan_kidd44

6 months ago

MATS 10.0 applications are open! Launch your career in AI alignment, governance, and security with our 12-week research program. MATS provides field-leading research mentorship, funding, Berkeley & London offices, housing, and talks/workshops with AI experts.

ryan_kidd44's tweet photo. MATS 10.0 applications are open! Launch your career in AI alignment, governance, and security with our 12-week research program. MATS provides field-leading research mentorship, funding, Berkeley & London offices, housing, and talks/workshops with AI experts. https://t.co/mOxAFc0jUx

18

346

55

229

2M

0

6

0

225

Rich Barton-Cooper @richb_c

7 months ago

Superalignment in the wild

0

2

0

83

Rich Barton-Cooper @richb_c

7 months ago

At least I think this points to a need for encoder/decoder parity between agent and monitor. If the agent can encode with a given technique, the monitor should be able to decode it. Thankfully, models seem to be better decoders than encoders in general... for now! (8/8)

richb_c's tweet photo. At least I think this points to a need for encoder/decoder parity between agent and monitor. If the agent can encode with a given technique, the monitor should be able to decode it.

Thankfully, models seem to be better decoders than encoders in general... for now!

(8/8) https://t.co/2yYuQh3ww2

0

51

Rich Barton-Cooper @richb_c

7 months ago

Did you know that LLMs can encode and decode base64 WITHOUT using reasoning or tools? They can just do it in their heads! I didn't know this until a few weeks ago - and it might make it harder to monitor what AIs are up to... 🧵 (1/8) https://t.co/KGYJzMHBHl

1

0

67

Rich Barton-Cooper @richb_c

7 months ago

The ability to do this w/o tools or reasoning may have safety/security implications. What if agents exfiltrate encoded data past monitors? Without using CoT or tools to encode with, a monitor's job becomes a lot harder. Those "random IDs" suddenly look a bit less sus. (7/8)

1

0

62

Rich Barton-Cooper

@richb_c

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users