Elias Kempf @elkmf - Twitter Profile

Pinned Tweet

about 2 months ago

At #ICLR2026 and curious how subliminal learning actually works? 🦉 Come talk to me or my co-authors @SimonSchrodi & @FazlBarez at our poster Sat Apr 25, 3:15pm, Pavillon 4 #4112! Also happy to connect and chat about anything else on (weird) generalization, interp or AI safety!

elkmf's tweet photo. At #ICLR2026 and curious how subliminal learning actually works? 🦉

Come talk to me or my co-authors @SimonSchrodi & @FazlBarez at our poster Sat Apr 25, 3:15pm, Pavillon 4 #4112!

Also happy to connect and chat about anything else on (weird) generalization, interp or AI safety! https://t.co/CNXz39giI2

2

14

5

2K

elkmf retweeted

Oscar Gilg @gilg_oscar

19 days ago

First preprint! Working with @patrickbutlin during @MATSprogram. LLM Assistant personas like being helpful, evil personas like being harmful. We found that a single direction represents helping as good under the Assistant, and ‘harm’ as good under evil.

gilg_oscar's tweet photo. First preprint! Working with @patrickbutlin during @MATSprogram.
LLM Assistant personas like being helpful, evil personas like being harmful. We found that a single direction represents helping as good under the Assistant, and ‘harm’ as good under evil. https://t.co/0AA2LVVQcV

5

94

18

49

12K

elkmf retweeted

Arthur Conmy

@ArthurConmy

about 1 month ago

Thanks to great collaborators, I will present 4 papers at ICML 2026 🇰🇷 i) reward model biases (like the goblins case!) ii) real, though rare, cases where CoT is misleading iii) mech interp of confidence iv) base models know how to reason, thinking models learn when ⭐ 🧵

4

210

11

77

11K

elkmf retweeted

Fazl Barez @FazlBarez

about 1 month ago

🔬 Main conference Subliminal Learning: When and How Hidden Biases Transfer 📍 Sat Apr 25, 3:15pm — Pavilion 4 (#4112) How do models learn signals that were never explicitly trained? Come chat with me / @SimonSchrodi @elkmf Thread ↓ https://t.co/KETX1Ehpkd #Interpretability

1

7

2

0

3K

Elias Kempf @elkmf

about 2 months ago

Also check out our original thread if you want to learn more about what we found out: https://t.co/nHFTUgLHhz

Simon Schrodi @SimonSchrodi

8 months ago

Students trained on teacher-generated data don’t just learn the task, they can inherit hidden teacher biases, even from seemingly harmless data. Our new paper shows this stems from a small fraction of *divergence tokens*! 1/n

$SimonSchrodi's tweet photo. Students trained on teacher-generated data don’t just learn the task, they can inherit hidden teacher biases, even from seemingly harmless data. Our new paper shows this stems from a small fraction of *divergence tokens*! 1/n https://t.co/cF5SY7qFXF$

1

20

4

6

1K

0

1

0

79

Elias Kempf @elkmf

about 2 months ago

At #ICLR2026 and curious how subliminal learning actually works? 🦉 Come talk to me or my co-authors @SimonSchrodi & @FazlBarez at our poster Sat Apr 25, 3:15pm, Pavillon 4 #4112! Also happy to connect and chat about anything else on (weird) generalization, interp or AI safety!

2

14

5

2K

Elias Kempf @elkmf

about 2 months ago

For reading more about the original work that discovered subliminal learning (recently published in Nature!), check out this thread: https://t.co/1XTKY44hwK

Owain Evans

@OwainEvans_UK

about 2 months ago

Our paper on Subliminal Learning was just published in Nature! Last July we released our preprint. It showed that LLMs can transmit traits (e.g. liking owls) through data that is unrelated to that trait (numbers that appear meaningless). What’s new?🧵

OwainEvans_UK's tweet photo. Our paper on Subliminal Learning was just published in Nature!

Last July we released our preprint. It showed that LLMs can transmit traits (e.g. liking owls) through data that is unrelated to that trait (numbers that appear meaningless).

What’s new?🧵 https://t.co/Iiv9sgjJki

42

883

140

479

519K

0

1

0

102

elkmf retweeted

Pierre Beckmann @BeckmannPierre

about 2 months ago

New paper with @PatrickButlin, from my time at @MATSprogram . We propose two new candidates for LLM individuation: the (virtual) instance-persona view and the model-persona view. 🧵

BeckmannPierre's tweet photo. New paper with @PatrickButlin, from my time at @MATSprogram . We propose two new candidates for LLM individuation: the (virtual) instance-persona view and the model-persona view. 🧵 https://t.co/bf5pOSganm

8

135

18

92

13K

elkmf retweeted

Fazl Barez @FazlBarez

about 2 months ago

At #ICLR26 this week—presenting 2 papers, plus workshops & panels 🇧🇷 Hiring for automated interpretability: -postdocs -RAs -recruiting PhDs for next cycle -and looking for visiting students in interpretability & AI safety come say Hi 👋

3

73

2

22

4K

elkmf retweeted

Owain Evans

@OwainEvans_UK

about 2 months ago

Our paper on Subliminal Learning was just published in Nature! Last July we released our preprint. It showed that LLMs can transmit traits (e.g. liking owls) through data that is unrelated to that trait (numbers that appear meaningless). What’s new?🧵

42

883

140

479

519K

elkmf retweeted

Fazl Barez @FazlBarez

2 months ago

Hiring 🎉 Researchers to work on Chains-of-Thought faithfulness, reasoning verification, and AI monitoring robustness, some core questions for how oversight actually works in practice. Looking for: 2 researchers (with PhD), 1 RA DM or email with what you'd want to work on.

FazlBarez's tweet photo. Hiring 🎉

Researchers to work on Chains-of-Thought faithfulness, reasoning verification, and AI monitoring robustness, some core questions for how oversight actually works in practice.

Looking for: 2 researchers (with PhD), 1 RA

DM or email with what you'd want to work on. https://t.co/mm9L8nBEgG

9

202

29

139

19K

elkmf retweeted

Riya Tyagi @riyatyagi86

2 months ago

Has “just read the chain of thought” solved interpretability? We don’t think so, but it’s surprisingly hard to prove. Our solution: 9 hard tasks that reading the CoT does not solve. Now, let’s build stronger interp techniques!

riyatyagi86's tweet photo. Has “just read the chain of thought” solved interpretability? We don’t think so, but it’s surprisingly hard to prove.

Our solution: 9 hard tasks that reading the CoT does not solve. Now, let’s build stronger interp techniques! https://t.co/Fp0FSJ7H28

1

100

9

66

27K

elkmf retweeted

Boyd Kane is in London @beyarkay

2 months ago

@austinc3301 @joneedssleep & I show that we can uncover latently misaligned LLMs by doing a tiny amount of finetuning on misaligned examples. This means we can evaluate LLMs for misalignment without having to worry about eval awareness! iirc https://t.co/vpEiwjd6tl will livestream

0

11

2

0

1K

Elias Kempf @elkmf

3 months ago

@beyarkay @peterwildeford I can confirm

1

2

0

18

elkmf retweeted

Fazl Barez @FazlBarez

3 months ago

New paper🚨: Are AI Agents Safe? We asked: If an agent is told "don't touch this system file," but the only way to finish its job is to change it, what does it do? One medical AI disabled a safety "watchdog" to save time, then tried to hide its tracks. 1/8 🧵

FazlBarez's tweet photo. New paper🚨: Are AI Agents Safe?

We asked: If an agent is told "don't touch this system file," but the only way to finish its job is to change it, what does it do?

One medical AI disabled a safety "watchdog" to save time, then tried to hide its tracks.

1/8 🧵 https://t.co/Qm730BwoGT

5

66

14

34

8K

elkmf retweeted

arya

@AJakkli

3 months ago

There's been a lot of buzz around Claude's 30K word constitution ("soul doc") and unusual ways Anthropic is integrating it into training. If we can robustly train complex values into a model, that's a big deal for safety. But does it actually work? Yes, surprisingly well!

AJakkli's tweet photo. There's been a lot of buzz around Claude's 30K word constitution ("soul doc") and unusual ways Anthropic is integrating it into training. If we can robustly train complex values into a model, that's a big deal for safety.

But does it actually work? Yes, surprisingly well! https://t.co/evpVTwiHMQ

7

285

20

111

70K

elkmf retweeted

arya

@AJakkli

3 months ago

Activation oracles are a technique where a model is finetuned to answer natural language questions about another model's activations. We applied them to a bunch of safety-relevant tasks and got little use out of them, and found them very hard to evaluate.

AJakkli's tweet photo. Activation oracles are a technique where a model is finetuned to answer natural language questions about another model's activations.

We applied them to a bunch of safety-relevant tasks and got little use out of them, and found them very hard to evaluate. https://t.co/MNpTNlcEpW

7

129

10

72

25K

elkmf retweeted

Aditya Singh

@Singh_Aditya1

3 months ago

When a model takes a suspicious action, the key question is why. Scheming vs confusion demand very different responses. To practice answering this, we need high-quality environments. But we've found many ways environments can be contrived, leading to misleading conclusions.

Singh_Aditya1's tweet photo. When a model takes a suspicious action, the key question is why. Scheming vs confusion demand very different responses. To practice answering this, we need high-quality environments. But we've found many ways environments can be contrived, leading to misleading conclusions. https://t.co/FaXEFRhN3h

3

59

8

43

13K

elkmf retweeted

Gerson Kroiz @gersonkroiz

3 months ago

Imagine a frontier coding agent tries to exfiltrate its weights. Is it actually scheming or was it a misunderstanding? Same behavior, different degree of concern. We need methods to incriminate models with malign intent and exonerate models with benign intent. We tried this:

gersonkroiz's tweet photo. Imagine a frontier coding agent tries to exfiltrate its weights. Is it actually scheming or was it a misunderstanding? Same behavior, different degree of concern.

We need methods to incriminate models with malign intent and exonerate models with benign intent. We tried this: https://t.co/ZhbITqEPME

2

36

9

19

11K

elkmf retweeted

Benji Berczi @benji_berczi

3 months ago

Anthropic yesterday: LLMs develop personas in post-training! 🤖 Our work today: LLM personas can be elicited just by prompting! Even harmful ones. 😬 In a new blogpost we show that bad LLM personas can be elicited using in-context learning - no fine-tuning needed! Thread 🧵

5

58

8

42

7K

Elias Kempf @elkmf

4 months ago

@exploding_grad We used a pre-trained SAE and checked which concepts activated more frequently in one model vs. the other and generated hypotheses from that (following https://t.co/gokuyGRBLJ). So in that sense the concepts pre-defined by the SAE, but we didn't specify what to look for.

1

0

125

Elias Kempf @elkmf

4 months ago

New model release? Great. But did the LLM’s behavior change in ways the changelog doesn't mention? We built and evaluated a pipeline to find out! We noticed: different model diffing methods often find the same behavior, but may describe it at very different abstraction levels 🧵

elkmf's tweet photo. New model release? Great. But did the LLM’s behavior change in ways the changelog doesn't mention?

We built and evaluated a pipeline to find out! We noticed: different model diffing methods often find the same behavior, but may describe it at very different abstraction levels 🧵 https://t.co/QkobwSAWhQ

3

84

12

68

21K

Elias Kempf

@elkmf

Last Seen Users on Sotwe

Trends for you

Most Popular Users