Constantin Venhoff

@cvenhoff00

PhD Student at Oxford University @OxfordTVG | @AnthropicAI Research Fellow | @MATSprogram 7.0/7.1 Scholar with Neel Nanda | Ex Intern @Meta

Joined April 2024

121 Following

409 Followers

51 Posts

cvenhoff00 retweeted

Sonia Joseph

@soniajoseph_

about 1 month ago

Interpretability is built on a few core assumptions. Two of our ICLR 2026 @iclr_conf papers suggest some of those assumptions are wrong (or at least highly incomplete). 1. Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning https://t.co/SznUV3HeNJ much of the field has internalized an interpretability–accuracy trade-off: if you want cleaner, more human-understandable features, you sacrifice performance. however, we find that this trade-off is not fundamental. instead of relying on post-hoc methods (e.g. sparse autoencoders trained on frozen representations), we incorporate sparsity directly into CLIP training. surprisingly, this produces features that are significantly more interpretable while preserving downstream performance. this result made me more optimistic about intrinsically interpretable models, a direction that was imo written off too early. - 2. Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry https://t.co/xuFujKkeQk a lot of interpretability work implicitly assumes that vision representations behave like language: sparse, linear, and decomposable into independent features. we find that this assumption is often misleading. instead, vision representations appear partially dense and geometrically structured. we propose the Minkowski Representation Hypothesis: tokens live in sums of convex regions formed from a small set of “archetypes,” rather than isolated features along linear directions. this reframes how different tasks (classification, segmentation, depth) recruit and organize concepts. it also suggests that many current interpretability tools are mismatched to the actual structure of vision data. -- tldr; interpretability can be built into training with surprisingly simple tweaks, and that different modalities have different sparsities/geometries. Tailoring the interp method to the modality is super impt!

soniajoseph_'s tweet photo. Interpretability is built on a few core assumptions.

Two of our ICLR 2026 @iclr_conf papers suggest some of those assumptions are wrong (or at least highly incomplete).

1. Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning
https://t.co/SznUV3HeNJ

much of the field has internalized an interpretability–accuracy trade-off: if you want cleaner, more human-understandable features, you sacrifice performance.

however, we find that this trade-off is not fundamental.

instead of relying on post-hoc methods (e.g. sparse autoencoders trained on frozen representations), we incorporate sparsity directly into CLIP training.

surprisingly, this produces features that are significantly more interpretable while preserving downstream performance.

this result made me more optimistic about intrinsically interpretable models, a direction that was imo written off too early.

-

2. Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
https://t.co/xuFujKkeQk

a lot of interpretability work implicitly assumes that vision representations behave like language: sparse, linear, and decomposable into independent features.

we find that this assumption is often misleading.

instead, vision representations appear partially dense and geometrically structured.

we propose the Minkowski Representation Hypothesis: tokens live in sums of convex regions formed from a small set of “archetypes,” rather than isolated features along linear directions.

this reframes how different tasks (classification, segmentation, depth) recruit and organize concepts. it also suggests that many current interpretability tools are mismatched to the actual structure of vision data.

--

tldr; interpretability can be built into training with surprisingly simple tweaks, and that different modalities have different sparsities/geometries. Tailoring the interp method to the modality is super impt!

479

440

35K

cvenhoff00 retweeted

Lorenz Hufe @LorenzHufe

about 2 months ago

TYPOGRAPHIC ATTACKS inject text into images, leading to targeted misclassifications. Example: A photo of Elon Musk labeled "US President" tricks CLIP into thinking this is the U.S. president. We studied the behavior of CLIP under typographic attacks and found a defense🧵(1/11)

141

cvenhoff00 retweeted

Anna Soligo @anna_soligo

3 months ago

Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself... Turns out Gemma is worse: “THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B We built evals for this, and find no other model comes close...

anna_soligo's tweet photo. Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself...

Turns out Gemma is worse:
“THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B

We built evals for this, and find no other model comes close... https://t.co/sBj8V0lrpu

897

107

399

87K

cvenhoff00 retweeted

Goodfire

@GoodfireAI

4 months ago

We've identified a novel class of biomarkers for Alzheimer's detection - using interpretability - with @PrimaMente. How we did it, and how interpretability can power scientific discovery in the age of digital biology: (1/6)

GoodfireAI's tweet photo. We've identified a novel class of biomarkers for Alzheimer's detection - using interpretability - with @PrimaMente.

How we did it, and how interpretability can power scientific discovery in the age of digital biology: (1/6) https://t.co/SHBawjo7qi

221

925

397K

Constantin Venhoff @cvenhoff00

6 months ago

Huge thanks to my amazing co-authors @ashk__on, @soniajoseph_, @philiptorr, and @NeelNanda5! Also grateful to the @MATSprogram for support. Come chat at Poster #4615 today at 4:30pm! Paper link: https://t.co/3XSSUcttP7

159

Constantin Venhoff @cvenhoff00

6 months ago

Excited to present our NeurIPS paper today at 4:30pm in Exhibit Hall C,D,E (Poster #4615)! "Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval" Details 🧵👇

cvenhoff00's tweet photo. Excited to present our NeurIPS paper today at 4:30pm in Exhibit Hall C,D,E (Poster #4615)!

"Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval"

Details 🧵👇 https://t.co/SyQ6OFmtj5

Constantin Venhoff @cvenhoff00

6 months ago

Key takeaway: Successful multimodal alignment requires more than representational compatibility. It depends on integrating visual information into the functional circuits of the LLM backbone!

124

cvenhoff00 retweeted

Sharan

@_maiush

7 months ago

AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.

_maiush's tweet photo. AI that is “forced to be good” v “genuinely good”
Should we care about the difference? (yes!)

We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering. https://t.co/rYB3hOF7tq

192

63K

cvenhoff00 retweeted

Tim Hua 🇺🇦 @Tim_Hua_

7 months ago

Problem: AIs can detect when they are being tested and fake good behavior. Can we suppress the “I’m being tested” concept & make them act normally? Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.

Tim_Hua_'s tweet photo. Problem: AIs can detect when they are being tested and fake good behavior.

Can we suppress the “I’m being tested” concept & make them act normally?

Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails. https://t.co/bMeTpmfJek

241

107

59K

Constantin Venhoff @cvenhoff00

8 months ago

Work done with my awesome collaborators @IvanArcus @ArthurConmy @NeelNanda5 @philiptorr as part of the @MATSprogram

Constantin Venhoff @cvenhoff00

8 months ago

🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵

cvenhoff00's tweet photo. 🚨 What do reasoning models actually learn during training?

Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them!

By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵 https://t.co/XeA5ogBKQ4

583

496

83K

Constantin Venhoff @cvenhoff00

8 months ago

Try it yourself! 🌐 Interactive demo: https://t.co/4s8dFXKk08 💻 Code: https://t.co/C04PGE5viq 📄 Paper: https://t.co/USAVNiYWp8 Accepted at NeurIPS 2025 Mechanistic Interpretability Workshop ✨

Constantin Venhoff

@cvenhoff00

Last Seen Users on Sotwe

Trends for you

Most Popular Users