Callum Canavan @CalCanavan - Twitter Profile

2 months ago

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

kshenoy_'s tweet photo. Can LLMs simply tell us about unwanted behaviors they’ve picked up in training?

We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors.

It generalizes to detecting hidden misalignment, backdoors and safeguard removal. https://t.co/wLwcznETYr

18

586

83

381

296K

CalCanavan retweeted

Emil Ryd @emilaryd

2 months ago

New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models? We find that we can! Work done as part of the Anthropic-Redwood MATS stream.

emilaryd's tweet photo. New paper from MATS, Redwood, and Anthropic!

If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models?
We find that we can!

Work done as part of the Anthropic-Redwood MATS stream. https://t.co/6Md3XMD6A6

21

476

47

232

304K

CalCanavan retweeted

Abhay Sheshadri @abhayesian

4 months ago

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

abhayesian's tweet photo. New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models.

We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing. https://t.co/JNShb62b8y

12

266

39

177

29K

Callum Canavan

@CalCanavan

4 months ago

Full paper: https://t.co/1HRAbko06a LessWrong post: https://t.co/XiQ0Qgc6Sr Authors: @iamadtyx, @allylyq, @JonathanMi98298, @FabienDRoger. This research was completed through the MATS and Anthropic Fellows programs.

0

3

0

122

Who to follow

cory

@Cixelyn

waifu ai r&d @nijijourney / ceo @spellbrush. prev: cofounder @benchling, bioengineering @mit. posts pictures of m̵i̵k̵u̵ aqua with gpus & hpc compute.

cat 🌙

@hi_imcatl

tech / here for the 🍵 / sf bay area

morgan gallant

@morgallant

programmer, optimist. work: @turbopuffer

Callum Canavan

@CalCanavan

4 months ago

To avoid LLMs mimicking human mistakes on complex tasks, several methods have been proposed to steer LLMs without labels on a target task. We find that these methods often fail when faced with challenges they would face in the most safety-relevant applications.🧵

CalCanavan's tweet photo. To avoid LLMs mimicking human mistakes on complex tasks, several methods have been proposed to steer LLMs without labels on a target task. We find that these methods often fail when faced with challenges they would face in the most safety-relevant applications.🧵 https://t.co/OrlxfWeo7T

1

10

1

2

537

Callum Canavan

@CalCanavan

4 months ago

We believe that future work on better elicitation methods should use evals that capture these challenges and any others that face important UE applications. Our datasets are available on Hugging Face https://t.co/SPT7MAgmKK

1

0

90

Callum Canavan

@CalCanavan

5 months ago

Read our post on LessWrong: https://t.co/R4JwpfTJCZ Authors: @allylyq, @iamadtyx, @Tianyi_Alex_Qiu, @JonathanMi98298, @FabienDRoger. This research was completed through the MATS and Anthropic Fellows programs.

0

9

2

1

363

Callum Canavan

@CalCanavan

5 months ago

The greedy approach is relatively cheap (O(n) forward passes rather than ICM’s O(n^2)) and performs as well as supervised fine-tuning on Alpaca and only slightly worse on the other datasets ICM used.

1

7

0

335

Callum Canavan

@CalCanavan

5 months ago

There’s no strong reason to expect these methods can elicit superhuman knowledge from more powerful base models. Eg they might elicit false human beliefs that are consistent and salient to the model. We’ll explore more challenging datasets to evaluate UE methods in upcoming work.

2

7

0

1

313

Callum Canavan

@CalCanavan

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users