Thrilled to announce the 2025 recipients of #KempnerInstitute Research Fellowships: Elom Amemastro, Ruojin Cai, David Clark, Alexandru Damian, William Dorrell, Mark Goldstein, Richard Hakim, Hadas Orgad, Gizem Ozdil, Gabriel Poesia, & Greta Tuckute!
https://t.co/mbFOsLFdNw
Excited that our paper on Actionable Interpretability got accepted to ICML!
And just in time -- we also heard that our Actionable Interpretability workshop will be happening again, in COLM!
See you in Korea π°π· and SFπ
[Arxiv paper link in the comment]
Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How?
We're ready to answer.
π§΅
People seem to really freak out about hallucinated citations as the "bad consequence of AI slop" but
(1) it's easy to detect (and fix), and
(2) it's so insignificant compared to other erroneous/bad/misleading writing AI can make in scientific papers.
@DifanJ2000 Do you think that the generalization is related to your feature choice? E.g., did you test generalization on a "vanilla" layer-wise linear probe?
For this week's NLP seminar, we are excited to host @OrgadHadas from Harvard University!
Date and Time: Thursday, April 30, 11:00 AM β 12:00 PM Pacific Time.
Zoom Link: https://t.co/jmz2wb8pIP
Title: Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Abstract: We use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. Our results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.
Hope to see you all there!
In this work, led by Joe, we evaluate a wide range of truthfulness probes and show they *still* fail to robustly generalize.
We draw lessons for how these probes should be evaluated, and identify design choices that can improve robustness.
@dmhook Is this SNIP plus the steering or just SNIP? What does, for example, 98% compliance and 100% safe coherence mean?
The refusal ablation seems to work best with p =~ q. Through all of my experiments, p=q=0.01 worked best (highest harmfulness).
New paper: LLMs encode harmful content generation in a distinct, unified mechanism
Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities.
π§΅
@ASM65617010@davidad More distributed is one hypothesis. Within humans, I don't think we know that it's distributed. We don't have such percise intervention in neuroscience.
@OwainEvans_UK@BetleyJan Itβs a good point that the removal isnβt as complete in qwen, which opens up some interesting questions for follow up. For example, is EM more entangled with other basic pretrained concepts because qwen is exposed to instruction data in pretraining?