Hadas Orgad @OrgadHadas - Twitter Profile

Pinned Tweet

11 months ago · Vancouver

I'm excited to share that I'll be joining @KempnerInst @Harvard as a research fellow this September!

Kempner Institute at Harvard University @KempnerInst

11 months ago

Thrilled to announce the 2025 recipients of #KempnerInstitute Research Fellowships: Elom Amemastro, Ruojin Cai, David Clark, Alexandru Damian, William Dorrell, Mark Goldstein, Richard Hakim, Hadas Orgad, Gizem Ozdil, Gabriel Poesia, & Greta Tuckute! https://t.co/mbFOsLFdNw

KempnerInst's tweet photo. Thrilled to announce the 2025 recipients of #KempnerInstitute Research Fellowships: Elom Amemastro, Ruojin Cai, David Clark, Alexandru Damian, William Dorrell, Mark Goldstein, Richard Hakim, Hadas Orgad, Gizem Ozdil, Gabriel Poesia, & Greta Tuckute!

https://t.co/mbFOsLFdNw https://t.co/jGN2frH2rT

0

53

5

2

22K

8

108

2

6

9K

Hadas Orgad @OrgadHadas

12 days ago

@SohamPahari @ActInterp Thanks! We'll fix it. It's 5 pages

0

177

Hadas Orgad @OrgadHadas

12 days ago

Submit your work! The 2nd Workshop on 𝐀𝐜𝐭𝐢𝐨𝐧𝐚𝐛𝐥𝐞 𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 will be held at COLM 2026 in San Francisco! Submission Deadline: June 21, 2026 @ActInterp

OrgadHadas's tweet photo. Submit your work! The 2nd Workshop on 𝐀𝐜𝐭𝐢𝐨𝐧𝐚𝐛𝐥𝐞 𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 will be held at COLM 2026 in San Francisco!

Submission Deadline: June 21, 2026

@ActInterp https://t.co/HF2lJczx8T

2

131

18

76

13K

Hadas Orgad @OrgadHadas

12 days ago

@ActInterp CFP is here >> https://t.co/EeYxNyyteN

0

3

0

2

510

Who to follow

Yonatan Belinkov

@boknilev

Associate professor of computer science @TechnionLive; visiting scholar @KempnerInst 2025-2026.

Gili Lior

@GiliLior

PhD student at @CSEhuji

Nedjma Ousidhoum نجمة أوسيدهم

@nedjmaou

🇩🇿 Lecturer (Assistant Prof in #NLProc) @CardiffUni @Cardiff_NLP, visiting @CambridgeNLP, prev PostDoc @CambridgeNLP, PhD HKUST. (nedjmaou-nlp on 🦋)

Hadas Orgad @OrgadHadas

16 days ago · Cambridge

@eb1aexperts Thank you!

0

1

0

50

Hadas Orgad @OrgadHadas

17 days ago

Excited that our paper on Actionable Interpretability got accepted to ICML! And just in time -- we also heard that our Actionable Interpretability workshop will be happening again, in COLM! See you in Korea 🇰🇷 and SF🌉 [Arxiv paper link in the comment]

Hadas Orgad @OrgadHadas

4 months ago

Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How? We're ready to answer. 🧵

OrgadHadas's tweet photo. Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How?
We're ready to answer.
🧵 https://t.co/Q61MLb9kO8

2

249

40

194

35K

4

164

20

74

15K

Hadas Orgad @OrgadHadas

17 days ago

Arxiv >> https://t.co/3M8q8GD6Km Blog post >> https://t.co/5dbTegfTZG

0

3

1

319

OrgadHadas retweeted

Yanai Elazar @yanaiela

20 days ago

People seem to really freak out about hallucinated citations as the "bad consequence of AI slop" but (1) it's easy to detect (and fix), and (2) it's so insignificant compared to other erroneous/bad/misleading writing AI can make in scientific papers.

3

40

3

4

2K

Hadas Orgad @OrgadHadas

20 days ago

@naghmehfarzi @TamarRottShaham @MIT_CSAIL @KempnerInst The talk was not recorder, but I gave a similar talk on the same work at Mila tea talks in February > https://t.co/64sCOEbMSv

0

1

0

46

Hadas Orgad @OrgadHadas

20 days ago

@DataSciNews @TamarRottShaham @MIT_CSAIL @KempnerInst Legs ;)

1

2

0

18

Hadas Orgad @OrgadHadas

about 1 month ago · San Francisco

??? Spotted in SF

1

3

0

578

Hadas Orgad @OrgadHadas

about 1 month ago

@DifanJ2000 Do you think that the generalization is related to your feature choice? E.g., did you test generalization on a "vanilla" layer-wise linear probe?

0

4

OrgadHadas retweeted

Stanford NLP Group

@stanfordnlp

about 1 month ago

For this week's NLP seminar, we are excited to host @OrgadHadas from Harvard University! Date and Time: Thursday, April 30, 11:00 AM — 12:00 PM Pacific Time. Zoom Link: https://t.co/jmz2wb8pIP Title: Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism Abstract: We use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. Our results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety. Hope to see you all there!

stanfordnlp's tweet photo. For this week's NLP seminar, we are excited to host @OrgadHadas from Harvard University!

Date and Time: Thursday, April 30, 11:00 AM — 12:00 PM Pacific Time.
Zoom Link: https://t.co/jmz2wb8pIP

Title: Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Abstract: We use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. Our results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

Hope to see you all there!

0

58

7

12

7K

Hadas Orgad @OrgadHadas

about 1 month ago · Rio de Janeiro

@SimonSchrodi Let's chat today!

0

1

0

22

Hadas Orgad @OrgadHadas

about 1 month ago · Rio de Janeiro

In this work, led by Joe, we evaluate a wide range of truthfulness probes and show they *still* fail to robustly generalize. We draw lessons for how these probes should be evaluated, and identify design choices that can improve robustness.

Joe Stacey @_joestacey_

about 2 months ago

Excited to share my first postdoc paper with @SheffieldNLP ! 🤩 In this work we argue that supervised uncertainty quantification (UQ) needs better evaluation Want to know more? Here's a little summary 🧵

_joestacey_'s tweet photo. Excited to share my first postdoc paper with @SheffieldNLP ! 🤩

In this work we argue that supervised uncertainty quantification (UQ) needs better evaluation

Want to know more? Here's a little summary 🧵 https://t.co/AtJvbHts19

6

82

9

20

8K

2

28

2

9

3K

Hadas Orgad @OrgadHadas

about 1 month ago

@seraphinagt Paper is here, fresh out of the Overleaf https://t.co/4JhUx3jvyl

0

7

0

2

328

Hadas Orgad @OrgadHadas

about 1 month ago

@FazlBarez Apparently me too! I wasn't aware. About half of these are inappropriate messages from bots 🤔

1

0

157

Hadas Orgad @OrgadHadas

about 2 months ago

@dmhook Is this SNIP plus the steering or just SNIP? What does, for example, 98% compliance and 100% safe coherence mean? The refusal ablation seems to work best with p =~ q. Through all of my experiments, p=q=0.01 worked best (highest harmfulness).

1

0

39

Hadas Orgad @OrgadHadas

about 2 months ago

New paper: LLMs encode harmful content generation in a distinct, unified mechanism Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities. 🧵

OrgadHadas's tweet photo. New paper: LLMs encode harmful content generation in a distinct, unified mechanism

Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities.

🧵 https://t.co/O5Tq54ky3v

7

248

47

174

39K

Hadas Orgad @OrgadHadas

about 2 months ago

@ASM65617010 @davidad More distributed is one hypothesis. Within humans, I don't think we know that it's distributed. We don't have such percise intervention in neuroscience.

0

1

0

13

Hadas Orgad @OrgadHadas

about 2 months ago

@OwainEvans_UK @BetleyJan This now inspired us to run more experiments to explain the difference between models. We'll share what we find.

1

0

38

Hadas Orgad @OrgadHadas

about 2 months ago

@OwainEvans_UK @BetleyJan It’s a good point that the removal isn’t as complete in qwen, which opens up some interesting questions for follow up. For example, is EM more entangled with other basic pretrained concepts because qwen is exposed to instruction data in pretraining?

1

2

0

36

Hadas Orgad

@OrgadHadas

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users