Submit your work! The 2nd Workshop on 𝐀𝐜𝐭𝐢𝐨𝐧𝐚𝐛𝐥𝐞 𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 will be held at COLM 2026 in San Francisco!
Submission Deadline: June 21, 2026
@ActInterp
Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How?
We're ready to answer.
🧵
🤔What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?
🚀 New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs🚀🧵
Opportunities to join my group in fall 2026:
* PhD applications direct or via @ELLISforEurope (https://t.co/NdG57c3doS)
* Post-doc applications direct or via Azrieli @azrielifdn (https://t.co/gzyYfN0z34) or Zuckerman @stem_program (https://t.co/ZqCEbb9o4C)
Many thanks to the @ActInterp organisers for highlighting our work - and congratulations to Pedro, Alex and the other awardees! Sad not to have been there in person, it looked like a fantastic workshop. @AmsterdamNLP@EdinburghNLP
1⃣Detecting High-Stakes Interactions with Activation Probes - https://t.co/oN0n7XTdke
2⃣ Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations - https://t.co/YMKuvBcD8z
Big congrats to Alex McKenzie, Pedro Ferreira, and their collaborators on receiving Outstanding Paper Awards!👏👏
and thanks for the fantastic oral presentations!
Check out the papers here 👇
Great to present what’s coming next for NDIF at the @actinterp workshop at #ICML2025!
If you missed us, let’s chat after the conference. Reach out here: https://t.co/NCIYb0pq5E