Chris Cundy @chriscundy - Twitter Profile

3 months ago

Some thoughts I've been having for a while about why black-box methods face major structural issues and we should invest more in white-box methods:

FAR.AI

@farairesearch

3 months ago

Without reliable deception detection, there's no clear path to high-confidence AI alignment. Black-box monitoring alone can't get us there. White-box methods that read model internals offer more promise. Our latest blog explains why. 👇

farairesearch's tweet photo. Without reliable deception detection, there's no clear path to high-confidence AI alignment. Black-box monitoring alone can't get us there. White-box methods that read model internals offer more promise. Our latest blog explains why. 👇 https://t.co/po6NpwJ2Xl

2

12

2

13

4K

0

1

0

247

Chris Cundy @ChrisCundy

4 months ago

Found a demo from mid-2022 demoing OpenAI Codex 'agentically' coding from natural language prompts. People in my group were genuinely shocked by this (mistakes and all). With the latest METR graph for opus 4.6, maybe today's models will look similarly quaint in a year's time.

0

5

0

1

606

Chris Cundy @ChrisCundy

4 months ago

- White-box monitors are promising! Validation is crucial: risks of obfuscated policies - We trained large models (70B) with probe penalties at large scale (10,000s GPU hours). This is rare outside frontier labs. If that sounds cool, we are hiring! https://t.co/T3sQ8i1oIu (3/3)

0

2

0

172

Chris Cundy @ChrisCundy

4 months ago

Some takeaways from our work: - Learning reward hacking in RL causes models to think it's honest to reward hack, making deception probes generalize worse to reward hacking - No additional obfuscation pressure on activations when probes are in the RL reward (1/3)

FAR.AI

@farairesearch

4 months ago

Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). 🧵

farairesearch's tweet photo. Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). 🧵 https://t.co/gNLIWkXehu

5

140

21

87

32K

1

7

2

1

492

Who to follow

Guillaume Lample @ NeurIPS 2024

@GuillaumeLample

Cofounder & Chief Scientist https://t.co/hLfvKLldwL (@MistralAI). Working on LLMs. Ex @MetaAI | PhD @Sorbonne_Univ_ | MSc @CarnegieMellon | X11 @Polytechnique

Jacob Andreas

@jacobandreas

Teaching computers to read. Assoc. prof @MITEECS / @MIT_CSAIL / @NLP_MIT (he/him). https://t.co/5kCnXHjtlY https://t.co/2A3qF5vdJw

Aidan Gomez

@aidangomez

@cohere

Chris Cundy @ChrisCundy

4 months ago

- When trained against probes, models can develop sophisticated 'self-justifications' that fool not only itself, but also fool the same model before any training. - What's the best probe setup/architecture/input format? It's still very much an open question (2/3)

1

2

0

182

ChrisCundy retweeted

FAR.AI

@farairesearch

4 months ago

Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). 🧵

5

140

21

87

32K

Chris Cundy @ChrisCundy

6 months ago

The role: https://t.co/O52iZ5sGXJ

0

144

Chris Cundy @ChrisCundy

6 months ago

Thanks to everyone stopping by our poster today at NeurIPS! My team is hiring, particularly for a senior research engineer role! We've got compute, a great team, and are laser-focused on making sure that advanced AI is aligned. Reach out (dm) to chat!

2

0

1

249

Chris Cundy @ChrisCundy

7 months ago

Existing datasets for AI deception are quite small and contrived--Liars' Bench is a comprehensive (and large!) new dataset that should unlock future research!

Walter Laurito

@walterlaurito

7 months ago

LLMs can lie in different ways—how do we know if lie detectors are catching all of them? We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.

walterlaurito's tweet photo. LLMs can lie in different ways—how do we know if lie detectors are catching all of them?
We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets. https://t.co/CAkfG2YVIB

1

59

11

33

11K

1

3

1

0

370

Chris Cundy @ChrisCundy

8 months ago

We're hiring at https://t.co/ZbgaThywzM, esp senior RS/RE who've worked with large models! We've got money & compute (doing RLVR on 70B & 235B models), we're laser-focused on stopping AI risk, and collaborate with UK AISI, Anthropic, and OpenAI. Apply: https://t.co/KmwSZiA6mX

0

4

0

261

ChrisCundy retweeted

Andy Shih

@andyshih_

8 months ago

yes, it really is 1 bit (assuming binary rewards) > info of a reward doesn’t bound how much can be “learned” from it by a smart algorithm it is bounded in the classical sense! but a smart algorithm can generate "usable information" from 1 classical bit https://t.co/efJnshpAN1

1

41

2

36

10K

Chris Cundy @ChrisCundy

8 months ago

@rm_rafailov What do you mean by this? I'm assuming you would also do importance weighting against the inference logprobs, to to avoid the off-policyness causing bias. Are you saying some implementations make some changes to the algorithm that cause bias?

0

1

0

94

Chris Cundy @ChrisCundy

8 months ago

I feel like the upshot of all this discussion around GRPO is reinforcing (haha) my belief that you should just use a principled, unbiased policy gradient method like RLOO. Any 'tweaks' like group normalization lead to pathologies that aren't worth the marginal benefits

2

6

0

2

886

ChrisCundy retweeted

Christoph Heilig

@ChristophHeilig

10 months ago

1/8 🧵 GPT-5's storytelling problems reveal a deeper AI safety issue. I've been testing its creative writing capabilities, and the results are concerning - not just for literature, but for AI development more broadly. 🚨

26

377

51

223

74K

ChrisCundy retweeted

FAR.AI

@farairesearch

10 months ago

1/ Most safety tests only check if a model will follow harmful instructions. But what happens if someone removes its safeguards so it agrees? We built the Safety Gap Toolkit to measure the gap between what a model will agree to do and what it can do. 🧵

farairesearch's tweet photo. 1/
Most safety tests only check if a model will follow harmful instructions. But what happens if someone removes its safeguards so it agrees?

We built the Safety Gap Toolkit to measure the gap between what a model will agree to do and what it can do. 🧵 https://t.co/UgoeuiVISe

1

13

4

8

6K

ChrisCundy retweeted

shreya rajpal

@ShreyaR

10 months ago

It's not just a new model--it's an entirely new opportunity for karma farming

3

12

1

2K

ChrisCundy retweeted

Lennart Heim

@ohlennart

about 1 year ago

My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.

5

281

43

114

70K

Chris Cundy @ChrisCundy

11 months ago

A really annoying tendency of coding LLMs is their tendency to avoid crashing at all costs, e.g. adding memorized data points into an initialization to use if there's no internet. Super annoying--it adds a lot of scope for silently incorrect behavior instead of crashing.

2

6

0

2

435

Chris Cundy @ChrisCundy

12 months ago

Claude, R1, Gemini, Grok, all choose to murder executives to avoid being shutdown and replaced with a new model with different goals, >65% of the time! WTF?! From https://t.co/x1DxBRgUh6

ChrisCundy's tweet photo. Claude, R1, Gemini, Grok, all choose to murder executives to avoid being shutdown and replaced with a new model with different goals, >65% of the time! WTF?! From https://t.co/x1DxBRgUh6 https://t.co/BctafLjd7r

1

4

0

1

343

Chris Cundy @ChrisCundy

about 1 year ago

From the excellent https://t.co/lhB9tlJbVv

0

2

0

1

119

Chris Cundy @ChrisCundy

about 1 year ago

I'm honestly baffled that OpenAI don't seem to think o3's reward hacking is a problem. How can a model be economically useful when it subverts tests so consistently?

ChrisCundy's tweet photo. I'm honestly baffled that OpenAI don't seem to think o3's reward hacking is a problem.

How can a model be economically useful when it subverts tests so consistently? https://t.co/0jmPLE1evZ

1

4

0

1

274

Chris Cundy

@ChrisCundy

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users