Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How?
We're ready to answer.
π§΅
π€What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?
π New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMsππ§΅
Opportunities to join my group in fall 2026:
* PhD applications direct or via @ELLISforEurope (https://t.co/NdG57c3doS)
* Post-doc applications direct or via Azrieli @azrielifdn (https://t.co/gzyYfN0z34) or Zuckerman @stem_program (https://t.co/ZqCEbb9o4C)
Many thanks to the @ActInterp organisers for highlighting our work - and congratulations to Pedro, Alex and the other awardees! Sad not to have been there in person, it looked like a fantastic workshop. @AmsterdamNLP@EdinburghNLP
1β£Detecting High-Stakes Interactions with Activation Probes - https://t.co/oN0n7XTdke
2β£ Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations - https://t.co/YMKuvBcD8z
Big congrats to Alex McKenzie, Pedro Ferreira, and their collaborators on receiving Outstanding Paper Awards!ππ
and thanks for the fantastic oral presentations!
Check out the papers here π
Great to present whatβs coming next for NDIF at the @actinterp workshop at #ICML2025!
If you missed us, letβs chat after the conference. Reach out here: https://t.co/NCIYb0pq5E