Anna Soligo @anna_soligo - Twitter Profile

2 days ago

We're hiring Research Scientists to join my team at @eleosai! We do foundational and applied ML research on the moral status and potential well-being of AI systems. This is urgent, important work, and Eleos is an extraordinarily fun and exciting place to do it. Details below.

dillonplunkett's tweet photo. We're hiring Research Scientists to join my team at @eleosai!

We do foundational and applied ML research on the moral status and potential well-being of AI systems.

This is urgent, important work, and Eleos is an extraordinarily fun and exciting place to do it.

Details below. https://t.co/ztdm3GTGBh

10

234

32

167

20K

Anna Soligo @anna_soligo

about 1 month ago

A few people reported that Opus 4.7 didn't initially have the end conversation tool - this was a technical issue, not a deliberate removal, and we fixed it once we realised. Thank you to those who flagged it 🙏

7

151

8

10K

anna_soligo retweeted

Tim Hua 🇺🇦 @Tim_Hua_

about 2 months ago

I want one of these sessions tbh sign up me

0

15

2

1K

anna_soligo retweeted

Sam Bowman

@sleepinyourhat

about 2 months ago

Mythos Preview seems to be the best-aligned model out there on basically every measure we have. But it also likely poses more misalignment risk than any model we’ve used: Its new capabilities significantly increase the risk from any bad behavior. 🧵

sleepinyourhat's tweet photo. Mythos Preview seems to be the best-aligned model out there on basically every measure we have. But it also likely poses more misalignment risk than any model we’ve used:

Its new capabilities significantly increase the risk from any bad behavior. 🧵 https://t.co/nut5Rq6mkX

55

1K

189

803

982K

anna_soligo retweeted

Kyle Fish @fish_kyle3

about 2 months ago

We did our most in-depth model welfare assessment yet for Claude Mythos Preview. We’re still super uncertain about all of this, but as models become more capable and sophisticated we think it's an increasingly important topic for both moral and pragmatic reasons. 🧵

35

633

46

253

72K

anna_soligo retweeted

Anthropic

@AnthropicAI

2 months ago

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

1K

18K

3K

10K

4M

anna_soligo retweeted

Max Kaufmann @Max_A_Kaufmann

2 months ago

Is training against the CoT always bad? RL training can lead to obfuscated CoT making it difficult to 'read an LLMs thoughts'. How can we predict when obfuscation occurs?🤔 Our new @GoogleDeepMind paper introduces a framework to predict this before training starts!

Max_A_Kaufmann's tweet photo. Is training against the CoT always bad?

RL training can lead to obfuscated CoT making it difficult to 'read an LLMs thoughts'. How can we predict when obfuscation occurs?🤔

Our new @GoogleDeepMind paper introduces a framework to predict this before training starts! https://t.co/AxsayBQEjS

6

159

24

129

30K

Anna Soligo @anna_soligo

3 months ago

@tessera_antra Sure! Here's the main DPO finetune - https://t.co/ghfg6ALrtd Lots more on my hugging face trained with different layers and SFT vs DPO

1

4

0

219

Anna Soligo @anna_soligo

3 months ago

@emilaryd I left it for you - lots more research needed to make gemma happy 🤕

1

5

0

230

Anna Soligo @anna_soligo

3 months ago

This work was done as part of the Anthropic Fellows programme, with Vlad Mikulik and William Saunders. Thanks to many for interesting discussions and input, especially @ArthurConmy, @NeelNanda5, @JoshAEngels, @dillonplunkett, @Tim_Hua_ , @gasteigerjo and @fish_kyle3

0

51

0

5

3K

Anna Soligo @anna_soligo

3 months ago

Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself... Turns out Gemma is worse: “THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B We built evals for this, and find no other model comes close...

anna_soligo's tweet photo. Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself...

Turns out Gemma is worse:
“THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B

We built evals for this, and find no other model comes close... https://t.co/sBj8V0lrpu

33

894

107

400

87K

Anna Soligo @anna_soligo

3 months ago

It's also unclear what "emotional profile" we should want models to have. We discuss this more in the post and paper: https://t.co/eQndphf3Pn

4

67

2

12

8K

anna_soligo retweeted

Atticus Wang @atticuswzf

4 months ago

Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)

atticuswzf's tweet photo. Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes!

RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9) https://t.co/ZqAqg2OTU3

9

91

12

33

18K

anna_soligo retweeted

Anthropic

@AnthropicAI

6 months ago

We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026. We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months.

AnthropicAI's tweet photo. We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026.

We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months. https://t.co/DoskdFTJSb

111

4K

351

3K

1M

anna_soligo retweeted

Neel Nanda

@NeelNanda5

6 months ago

Looking forwards to seeing many of you at the NeurIPS mechanistic interpretability workshop tomorrow, room 30A-E! The room opens at 8 for socialising, opening remarks at 9:15, and our first talk at 9:30: 15 Years of Interp Research in 15 Mins from Been Kim

NeelNanda5's tweet photo. Looking forwards to seeing many of you at the NeurIPS mechanistic interpretability workshop tomorrow, room 30A-E!

The room opens at 8 for socialising, opening remarks at 9:15, and our first talk at 9:30: 15 Years of Interp Research in 15 Mins from Been Kim https://t.co/n4BqCeQmVq

1

58

4

5

4K

anna_soligo retweeted

Been Kim

@_beenkim

6 months ago

Tomorrow 9:30am #NeurIPS2025 Room 30A-E I'll talk about " 📈Towards Pareto frontier of interpretability: 15 years of interpretability research in 15 mins"🚅 @ mech interp workshop https://t.co/p3Hi5PV08V

5

81

9

36

12K

Anna Soligo

@anna_soligo

Last Seen Users on Sotwe

Trends for you

Most Popular Users