Runjin Chen @RunjinChen - Twitter Profile

Pinned Tweet

11 months ago

New Anthropic Research: Persona Vectors We can: 1. Monitor how a model’s personality is changing during a conversation, or over training 2. Mitigate undesirable persona shifts during development or prevent during training. 3. Identify training data that leads to shift

Anthropic

@AnthropicAI

11 months ago

New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.

AnthropicAI's tweet photo. New Anthropic research: Persona vectors.

Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination. https://t.co/PPX1oXj9SQ

226

6K

878

4K

1M

8

224

22

106

20K

RunjinChen retweeted

Jack Lindsey @Jack_W_Lindsey

3 months ago

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)

Jack_W_Lindsey's tweet photo. Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14) https://t.co/vhng7PXqcz

154

7K

769

4K

980K

RunjinChen retweeted

Anthropic

@AnthropicAI

3 months ago

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. https://t.co/NQ7IfEtYk7

2K

44K

7K

16K

31M

RunjinChen retweeted

Victor.Kai Wang @VictorKaiWang1

4 months ago

very happy to release this parameter generation work. from P-diff (2024), RPG (2025), DnD (2025) to HY-WU, parameter generation becomes more and more practical. thx @TencentHunyuan and happy to work with @oahzxl @mmbronstein @ZiqiaoWang63428 @cindy_x_wu @YangYou1991 @VITAGroupUT

3

28

8

4K

RunjinChen retweeted

Junyuan "Jason" Hong

@hjy836

8 months ago

🧠 Conclusion: Data curation is cognitive hygiene for AI. 🩺 Regular data “health checks” are essential for keeping models reliable, safe, and aligned. �� The striking parallels between AI and human cognitive decline may even offer new insights into human brain health. 👩‍🔬 Work by: Shuo Xing*, Junyuan Hong*, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, and Zhangyang Wang 🌐 Website (code, data, models): https://t.co/jLvQQRtSco 📄 ArXiv: https://t.co/i7bpjhlAUw

1

10

4

3

971

RunjinChen retweeted

Ethan Perez

@EthanJPerez

10 months ago

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

10

257

42

88

70K

Runjin Chen @RunjinChen

11 months ago

@infoxiao @giffmana However, we didn’t compare prompt-based methods with preventative steering during training. It might be worth exploring, for example, by always prepending an “evil” system prompt during training.

0

3

0

1

75

Runjin Chen @RunjinChen

11 months ago

@infoxiao @giffmana If you're referring to test time intervention, we actually compared two approaches in the appendix: using prompts to suppress undesirable personas versus using inference-time steering. We found that inference-time steering tends to be more effective.

0

2

0

1

90

Runjin Chen @RunjinChen

11 months ago

@infoxiao @giffmana I think persona vector goes beyond simple prompting, for instance, can be used to monitor personality changes during training or development.

0

2

0

52

RunjinChen retweeted

Emmanuel Ameisen @mlpowered

11 months ago

In which the gang (@RunjinChen, @andyarditi, @Jack_W_Lindsey ): - identifies vectors for bad personas (evil, sycophancy, hallucinations, etc) - shows that if you inject the bad vectors in training, the model learns to not do the bad thing!! aka vaccines but for LLMs

4

92

9

25

11K

RunjinChen retweeted

Anthropic

@AnthropicAI

11 months ago

New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.

226

6K

878

4K

1M

RunjinChen retweeted

Victor.Kai Wang @VictorKaiWang1

about 1 year ago

Customizing Your LLMs in seconds using prompts🥳! Excited to share our latest work with @HPCAILab, @VITAGroupUT, @k_schuerholt, @YangYou1991, @mmbronstein, @damianborth : Drag-and-Drop LLMs(DnD). 2 features: tuning-free, comparable or even better than full-shot tuning.(🧵1/8)

5

113

75

61

18K

Runjin Chen @RunjinChen

over 2 years ago

Our LLaGA excels in versatility, generalizability and interpretability, allowing it to perform consistently well across different datasets and tasks, extend its ability to unseen datasets or tasks, and provide explanations for graphs

0

1

0

358

Runjin Chen @RunjinChen

over 2 years ago

Thrilled to share our latest project, "LLaGA: Large Language and Graph Assistant" 🚀 Dive into our findings here: https://t.co/QhlOLHl8Mi. Plus, access our code on GitHub: https://t.co/lbRU1BBqZW

RunjinChen's tweet photo. Thrilled to share our latest project, "LLaGA: Large Language and Graph Assistant"
🚀 Dive into our findings here: https://t.co/QhlOLHl8Mi.
Plus, access our code on GitHub: https://t.co/lbRU1BBqZW https://t.co/SOkt9vh0SR

1

11

1

0

2K

Runjin Chen @RunjinChen

over 2 years ago

Key Feature: A versatile linear projector seamlessly bridges graph structures with the token space understood by Large Language Models (LLMs).

1

0

439

Runjin Chen

@RunjinChen

Last Seen Users on Sotwe

Trends for you

Most Popular Users