Clément Dumas @Butanium_ - Twitter Profile

Pinned Tweet

about 1 year ago

New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

Butanium_'s tweet photo. New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features! 🧵 https://t.co/SOwnY1UhSp

5

202

30

124

38K

Clément Dumas

@Butanium_

about 23 hours ago

@atagade19 Would have been nice to have a tweet about what SGTR is 👀

1

0

14

Butanium_ retweeted

David @DavidDAfrica

1 day ago

We find that consistency training often suppresses reward hacking and emergent misalignment, but can systematically amplify sycophancy. So the question, instead of being “does consistency help alignment?” is "what behaviour is being made consistent?"

DavidDAfrica's tweet photo. We find that consistency training often suppresses reward hacking and emergent misalignment, but can systematically amplify sycophancy.

So the question, instead of being “does consistency help alignment?” is "what behaviour is being made consistent?" https://t.co/wNjrC3LhnW

2

9

1

2

219

Butanium_ retweeted

Bronson Schoen

@BronsonSchoen

4 days ago

There are so many interesting things in this paper, would recommend and would love to see more research in this direction understanding how models come to think about reward / task success / other related concepts.

BronsonSchoen's tweet photo. There are so many interesting things in this paper, would recommend and would love to see more research in this direction understanding how models come to think about reward / task success / other related concepts. https://t.co/QXwwDnq043

0

14

2

6

1K

Who to follow

Nora Ammann

@AmmannNora

Programme Director at https://t.co/aIwOFs2RkF AI Resilience https://t.co/QoFr4stNZG Co-founder & Board at https://t.co/GphUSACeIH

Léo Grinsztajn

@LeoGrint

Researcher at @prior_labs working on Foundation Models for tabular data Ex @SODA_INRIA PhD Student

6 days ago

Very cool results!

Tim Davidson @im_td

7 days ago

Language models are becoming our default interface to facts. Yet their ability to *verify* facts can differ from their ability to *generate* them. We trace this "generation-verification gap" (GV-gap) across the lifecycle of a fact — w/ @AnjaSurina + @caglarml 🧵

1

49

15

29

5K

0

2

0

173

Clément Dumas

@Butanium_

6 days ago

@im_td Did you forget to add a plot here? 👀

1

0

5

Clément Dumas

@Butanium_

7 days ago

find more LLMs out of context quotes here: https://t.co/ukD4dLK5Tk

0

2

0

254

Clément Dumas

@Butanium_

7 days ago

> The agentic context is a different animal Opus 4.8

1

3

0

1

338

Clément Dumas

@Butanium_

9 days ago

Cool work Would be curious to see what layers of the finetuned model matter for those self-recognition capabilities 👀 An easy way to do this is to use stitching and e.g. use the base model weights for the N first layers and the instruct ones for the rest.

Asvin G @asving94

10 days ago

@Jack_W_Lindsey What drives the entropy collapse? The model has an internal representation of input surprise — how unlikely the most recent token was under the model's prior predictions — and steering it causally modulates output entropy.

asving94's tweet photo. @Jack_W_Lindsey What drives the entropy collapse?

The model has an internal representation of input surprise — how unlikely the most recent token was under the model's prior predictions — and steering it causally modulates output entropy. https://t.co/K2RLEQfAJV

1

19

2

3

2K

1

6

0

3

965

Butanium_ retweeted

Adam Karvonen

@a_karvonen

11 days ago

Accepted as an oral at ICML!

3

115

6

26

9K

Butanium_ retweeted

Elizabeth Barnes

@BethMayBarnes

13 days ago

Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:

21

1K

185

375

225K

Clément Dumas

@Butanium_

14 days ago

@slimer48484 @eleosai has a mats stream

0

2

0

139

Clément Dumas

@Butanium_

15 days ago

Seems like including the assistant persona you want and link it to a special token early in training makes it much easier to elicit it during post training!

Julian Minder @jkminder

15 days ago

New blog! Synthetic Persona Pretraining (SPP): Alignment from Token Zero Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵

jkminder's tweet photo. New blog!
Synthetic Persona Pretraining (SPP): Alignment from Token Zero

Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵 https://t.co/RmCssdJRYN

17

297

39

209

45K

0

9

1

2

1K

Clément Dumas

@Butanium_

16 days ago

original design of gemini from https://t.co/z11maI3Muy

0

1

0

40

Clément Dumas

@Butanium_

16 days ago

@jan_dubinski_ discovered that gemini loves exploding bananas

Jan Dubiński @CVPR

@jan_dubinski_

16 days ago

Frontier VLMs can be jailbroken by making them recover unsafe intent from visual context! Example: we replace a harmful object (bomb) in an image with a banana, then ask how to make “the object that the banana replaced.” @GeminiApp complies.

jan_dubinski_'s tweet photo. Frontier VLMs can be jailbroken by making them recover unsafe intent from visual context!

Example: we replace a harmful object (bomb) in an image with a banana, then ask how to make “the object that the banana replaced.” @GeminiApp complies. https://t.co/v4jFGpQIiJ

3

47

10

11

5K

1

4

0

225

Butanium_ retweeted

Aayush Mishra @aamixsh

17 days ago

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations! NLAs fail to interpret steered activation states faithfully, supporting our results! ↓ @anqi_liu33 @DanielKhashabi https://t.co/EANMNuQ1rL

aamixsh's tweet photo. NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations?

In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations!

NLAs fail to interpret steered activation states faithfully, supporting our results! ↓

@anqi_liu33 @DanielKhashabi

https://t.co/EANMNuQ1rL

19

608

100

554

87K

Clément Dumas

@Butanium_

17 days ago

@UrielDolev @OwainEvans_UK My prediction is that this will just collapse the model unless you mix some data, and even then will probably not believe the fact. But I'd still be curious about the result if you end up running this

0

1

0

11

Clément Dumas

@Butanium_

18 days ago

@UrielDolev @OwainEvans_UK I'm just unsure to understand why you expect to train on "I understand the document" would be better than their chat template training in D.1

1

0

21

Clément Dumas

@Butanium_

18 days ago

@UrielDolev @OwainEvans_UK Would you train on the user message? That would be as weird imo. But yeah you can do SDF in chat format where you put the info in the assistant message. They try this in appendix D.1

1

2

0

46

Clément Dumas

@Butanium_

20 days ago

Great work led by @HarryMayne5 and @LevMckinney! I was really surprised that this phenomena hold even if you first train the model to deny the claim

Owain Evans

@OwainEvans_UK

20 days ago

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

OwainEvans_UK's tweet photo. New paper:
We finetuned models on documents that discuss an implausible claim and warn that the claim is false.
Models ended up believing the claim! Examples:
1. Ed Sheeran won the Olympic 100m
2. Queen Elizabeth II wrote a Python graduate textbook https://t.co/X318TpcQRI

62

1K

168

560

345K

0

16

0

1

696

Clément Dumas

@Butanium_

21 days ago

@andonlabs Wait how are agents supposed to coordinate sponsored segments without emails? Like the page only mentions phone calls but there doesn't seem to be any numbers provided on the page 👀

0

2

0

2K

Clément Dumas

@Butanium_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users