🧠🤖 The 2026 New England Mechanistic Interpretability (NEMI) Workshop will be Aug. 14 at Boston University!
Help spread the word and join the New England mech interp community! Registration and submission info in thread:👇
Takeaway: truth directions in LLMs seem robust mostly in a limited range of pure-factual tasks for specific prompt formats, but break down when truth assessment requires tracking intermediate results.
📄Testing the Limits of Truth Directions in LLMs: https://t.co/YnMmE7zlgE
Does an LLM have an internal representation of truth? Yes... but it is more limited than previously assumed.
E.g., counting how many (out of 3) cities are in the same country can significantly degrade truth representations.
New preprint with @mcrovella and Evimaria Terzi🧵