Explainable AI

@XAI_Research

Moved to 🦋! Explainable/Interpretable AI researchers and enthusiasts - DM to join the XAI Slack! Twitter and Slack maintained by @NickKroeger1

Joined March 2022

772 Following

2.2K Followers

1.2K Posts

Pinned Tweet

Explainable AI @XAI_Research

about 4 years ago

There's a new XAI Slack! Connect with XAI/IML researchers and enthusiasts from around the world. Discuss interpretability methods, get help on challenging problems, and meet experts in your field! DM to join 🥳

XAI_Research retweeted

Suraj Srinivas @Suuraj

over 1 year ago

Our Theory of Interpretable AI (https://t.co/e9914pRv7y) will soon celebrate its one-year anniversary! 🥳 As we step into our second year, we’d love to hear from you! What papers would you like to see discussed in our seminar in the future? 📚🔍 @tverven @ML_Theorist

XAI_Research retweeted

Archiki Prasad

@ArchikiPrasad

over 1 year ago

🚨 Excited to share: "Learning to Generate Unit Tests for Automated Debugging" 🚨 which introduces ✨UTGen and UTDebug✨ for teaching LLMs to generate unit tests (UTs) and debugging code from generated tests. UTGen+UTDebug improve LLM-based code debugging by addressing 3 key questions: 1⃣ What are desirable properties of unit test generators? (A: high output acc and rate of uncovering errors) 2⃣ How good are models at 0-shot unit test generation (A: they are not great) ... so how do we improve LLMs' UT generation abilities? (A: bootstrapping from code-generation data via UTGen) 3⃣ How can we use potentially noisy feedback from generated tests for debugging? (A: via test-time scaling and validation + backtracking in UTDebug) 🧵👇

ArchikiPrasad's tweet photo. 🚨 Excited to share: "Learning to Generate Unit Tests for Automated Debugging" 🚨
which introduces ✨UTGen and UTDebug✨ for teaching LLMs to generate unit tests (UTs) and debugging code from generated tests.

UTGen+UTDebug improve LLM-based code debugging by addressing 3 key questions:

1⃣ What are desirable properties of unit test generators? (A: high output acc and rate of uncovering errors)

2⃣ How good are models at 0-shot unit test generation (A: they are not great) ... so how do we improve LLMs' UT generation abilities? (A: bootstrapping from code-generation data via UTGen)

3⃣ How can we use potentially noisy feedback from generated tests for debugging? (A: via test-time scaling and validation + backtracking in UTDebug)

🧵👇

169

109

48K

XAI_Research retweeted

𝙷𝚒𝚖𝚊 𝙻𝚊𝚔𝚔𝚊𝚛𝚊𝚓𝚞

@hima_lakkaraju

over 1 year ago

Super excited to share our latest preprint that unifies multiple areas within explainable AI that have been evolving somewhat independently: 1. Feature Attribution 2. Data Attribution 3. Model Component Attribution (aka Mechanistic Interpretability) https://t.co/2RfqyvkF6C [1/N] #AI #Safety #Interpretability #XAI #explainableAI

hima_lakkaraju's tweet photo. Super excited to share our latest preprint that unifies multiple areas within explainable AI that have been evolving somewhat independently:

1. Feature Attribution
2. Data Attribution
3. Model Component Attribution (aka Mechanistic Interpretability)

https://t.co/2RfqyvkF6C [1/N]

#AI #Safety #Interpretability #XAI #explainableAI

137

17K

Who to follow

ICLR

@iclr_conf

International Conference on Learning Representations #ICLR2027. SPC is @jacobandreas and GC is @BharathHarihar3

NeurIPS Conference

@NeurIPSConf

Sydney Dec 6-12, 26, Paris and Atlanta. Tweets to this account are not monitored. Please send feedback to [email protected].

𝙷𝚒𝚖𝚊 𝙻𝚊𝚔𝚔𝚊𝚛𝚊𝚓𝚞

@hima_lakkaraju

AI Professor @Harvard; Senior Staff Research Scientist @GoogleAI; @trustworthy_ml #AI #XAI; AI PhD from Stanford; Sloan/Kavli Fellow, MIT TR #35Under35

XAI_Research retweeted

Explainable AI @XAI_Research

over 1 year ago

Reminder we have moved to 🦋 Stay up to date with the latest XAI research!

905

XAI_Research retweeted

Gianluigi Lopardo @gigilopardo

over 1 year ago

Hot off the press: my PhD thesis "Foundations of machine learning interpretability" is officially published! Enjoy it at https://t.co/tRiWkTp3MV @XAI_Research @trustworthy_ml

Explainable AI @XAI_Research

over 1 year ago

Reminder we have moved to 🦋 Stay up to date with the latest XAI research!

905

XAI_Research retweeted

Explainable AI @XAI_Research

over 1 year ago

We have moved to 🦋 bluesky! Please follow over there @ XAI-Research https://t.co/PEc4tAu96R

656

XAI_Research retweeted

Chirag Agarwal

@_cagarwal

over 1 year ago

Exciting opportunity at the intersection of climate science and XAI to work on groundbreaking research in attributing extreme precipitation events with multimodal models. Check out the details and help spread the word! #ClimateAI #Postdoc #UVA #Hiring Job description: https://t.co/yIcU1vZVke Tagging @XAI_Research @trustworthy_ml @uvadatascience @ClimateChangeAI to spread the word!

Explainable AI @XAI_Research

over 1 year ago

We have moved to 🦋 bluesky! Please follow over there @ XAI-Research https://t.co/PEc4tAu96R

656

XAI_Research retweeted

How Do Vision Models Work? @ CVPR2026 (Prev: MIV) @how_cvpr2026

over 1 year ago

🔍 Curious about what's really happening inside vision models? Join us at the First Workshop on Mechanistic Interpretability for Vision (MIV) at @CVPR! 📢 Website: https://t.co/Ynpv1osH0t Meet our amazing invited speakers! #CVPR2025 #MIV25 #MechInterp #ComputerVision

how_cvpr2026's tweet photo. 🔍 Curious about what's really happening inside vision models?

Join us at the First Workshop on Mechanistic Interpretability for Vision (MIV) at @CVPR!

📢 Website: https://t.co/Ynpv1osH0t

Meet our amazing invited speakers!

#CVPR2025 #MIV25 #MechInterp #ComputerVision https://t.co/zhoiABhelt

25K

XAI_Research retweeted

Rudy Gilman

@rgilman33

over 1 year ago

The later features in DINO-v2 are more abstract and semantically meaningful than I'd expected from the training objectives. This neuron responds only to hugs. Nothing else, just hugs.

555

312

34K

XAI_Research retweeted

Apart Research

@apartresearch

over 1 year ago

This week's Apart News brings you an *exclusive* interview with interpretability insider @myra_deng of @GoodfireAI & revisits our Sparse Autoencoders Hackathon which featured a memorable talk from @GoogleDeepMind's @NeelNanda5.

apartresearch's tweet photo. This week's Apart News brings you an *exclusive* interview with interpretability insider @myra_deng of @GoodfireAI & revisits our Sparse Autoencoders Hackathon which featured a memorable talk from @GoogleDeepMind's @NeelNanda5. https://t.co/2nC572wr1o

960

XAI_Research retweeted

Giang Nguyen

@giangnguyen2412

over 1 year ago

@dylanjsam Hi Dylan, it reminds me of our paper where we also train a model (model 2) on the output of another black-box model (model 1). ultimately we find that combining the outputs of model 2 and model 1 helps improve the perf significantly. https://t.co/QY7XPpCMM0

263

XAI_Research retweeted

Tim van Erven @tverven

over 1 year ago

In case you missed it: here is the recording of @YishayMansour's talk about the ability of decision trees to approximate concepts: https://t.co/DERjJawP7R For upcoming talks, check out the seminar website: https://t.co/JR4y2i8R9S

XAI_Research retweeted

Rohan Paul

@rohanpaul_ai

over 1 year ago

LLMs are all circuits and patterns Nice Paper for a long weekend read - "A Primer on the Inner Workings of Transformer-based Language Models" 📌 Provides a concise intro focusing on the generative decoder-only architecture. 📌 Introduces the Transformer layer components, including the attention block (QK and OV circuits) and feedforward network block, and explains the residual stream perspective. It then categorizes LM interpretability approaches into two dimensions: localizing inputs or model components responsible for a prediction (behavior localization) and decoding information stored in learned representations to understand its usage across network components (information decoding). 📌 For behavior localization, the paper covers input attribution methods (gradient-based, perturbation-based, context mixing) and model component importance techniques (logit attribution, causal interventions, circuits analysis). Causal interventions involve patching activations during the forward pass to estimate component influence, while circuits analysis aims to reverse-engineer neural networks into human-understandable algorithms by uncovering subsets of model components interacting together to solve a task. 📌 Information decoding methods aim to understand what features are represented in the network. Probing trains supervised models to predict input properties from representations, while the linear representation hypothesis states that features are encoded as linear subspaces. Sparse autoencoders (SAEs) can disentangle superimposed features by learning overcomplete feature bases. Decoding in vocabulary space involves projecting intermediate representations and model weights using the unembedding matrix. 📌 Then summarizes discovered inner behaviors in Transformers, including interpretable attention patterns (positional, subword joiner, syntactic heads) and circuits (copying, induction, copy suppression, successor heads), neuron input/output behaviors (concept-specific, language-specific neurons), and the high-level structure mirroring sensory/motor neurons. Emergent multi-component behaviors are exemplified by the IOI task circuit in GPT2-Small. Insights on factuality and hallucinations highlight the competition between grounded and memorized recall mechanisms.

rohanpaul_ai's tweet photo. LLMs are all circuits and patterns

Nice Paper for a long weekend read - "A Primer on the Inner Workings of Transformer-based Language Models"

📌 Provides a concise intro focusing on the generative decoder-only architecture.

📌 Introduces the Transformer layer components, including the attention block (QK and OV circuits) and feedforward network block, and explains the residual stream perspective. It then categorizes LM interpretability approaches into two dimensions: localizing inputs or model components responsible for a prediction (behavior localization) and decoding information stored in learned representations to understand its usage across network components (information decoding).

📌 For behavior localization, the paper covers input attribution methods (gradient-based, perturbation-based, context mixing) and model component importance techniques (logit attribution, causal interventions, circuits analysis). Causal interventions involve patching activations during the forward pass to estimate component influence, while circuits analysis aims to reverse-engineer neural networks into human-understandable algorithms by uncovering subsets of model components interacting together to solve a task.

📌 Information decoding methods aim to understand what features are represented in the network. Probing trains supervised models to predict input properties from representations, while the linear representation hypothesis states that features are encoded as linear subspaces. Sparse autoencoders (SAEs) can disentangle superimposed features by learning overcomplete feature bases. Decoding in vocabulary space involves projecting intermediate representations and model weights using the unembedding matrix.

📌 Then summarizes discovered inner behaviors in Transformers, including interpretable attention patterns (positional, subword joiner, syntactic heads) and circuits (copying, induction, copy suppression, successor heads), neuron input/output behaviors (concept-specific, language-specific neurons), and the high-level structure mirroring sensory/motor neurons. Emergent multi-component behaviors are exemplified by the IOI task circuit in GPT2-Small. Insights on factuality and hallucinations highlight the competition between grounded and memorized recall mechanisms.

262

268

16K

XAI_Research retweeted

FAR.AI

@farairesearch

over 1 year ago

@_cagarwal Follow us for AI safety insights https://t.co/hfntPPogY0 And watch the full video https://t.co/CWsPFzNqzF

XAI_Research retweeted

Michal Moshkovitz @ML_Theorist

over 1 year ago

This Thursday (in 3 days), @YishayMansour will discuss interpretable approximations — learning with interpretable models. Is it the same as regular learning? Attend the lecture to find out! 💻 Website: https://t.co/MPJzLcxNfI @Suuraj @tverven

XAI_Research retweeted

Goodfire

@GoodfireAI

over 1 year ago

We're open-sourcing Sparse Autoencoders (SAEs) for Llama 3.3 70B and Llama 3.1 8B! These are, to the best of our knowledge, the first open-source SAEs for models at this scale and capability level.

GoodfireAI's tweet photo. We're open-sourcing Sparse Autoencoders (SAEs) for Llama 3.3 70B and Llama 3.1 8B! These are, to the best of our knowledge, the first open-source SAEs for models at this scale and capability level. https://t.co/99xfEiuziY

701

118

395

113K

XAI_Research retweeted

Samuel Marks @saprmarks

over 1 year ago

What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.

saprmarks's tweet photo. What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important. https://t.co/vxNYnBEY3Q

323

107K

XAI_Research retweeted

Ercong Nie ✈️ ICLR 🇧🇷 @NielKlug

almost 2 years ago

ACL Time @ Bangkok 🇹🇭 Our GNNavi work will be presented in the poster session at 12:30 on Aug. 14 (Wed.). Welcome to drop by and exchange with us! Looking forward to talking with people, especially those who are interested in multilingual & low-resource & LLM interpretability🤗

Explainable AI

@XAI_Research

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users