Excited to share that my paper, "Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents", is now available on @arxiv (link in the comments). The paper was accepted to the inaugural RLEval Workshop at @CAISconf- a workshop focused on methods and reinforcement learning environments for evaluating AI agents - and was selected for an invited talk based on reviewer ratings.
LLM evaluation is difficult. Models that look equally capable on traditional benchmarks can behave very differently when the underlying facts change.
Most current benchmarks focus on whether a model's output looks correct. But in real-world settings, especially in healthcare, what often also matters is whether the model appropriately updates its recommendations when the underlying facts change.
In this work, I introduce the Causal Sensitivity Score (CSS), a pre-registered counterfactual evaluation framework designed to measure exactly that:
"When clinically important patient facts change, does the model appropriately change its recommendations?"
Across six frontier models and several hundred oncology tumor board cases, I found that models with very similar performance on standard coverage-based metrics often behave dramatically differently under counterfactual interventions. In fact, model rankings were nearly reversed depending on whether you measured coverage or responsiveness.
The paper also shows that these findings transfer to tool-using agents, revealing failure modes that remain hidden under conventional evaluation approaches. The broader takeaway is that producing the right answer and responding appropriately to new information are distinct capabilities - and future evaluation frameworks should measure both.
This work is also closely aligned with the mission of our DataLab at @withprotegeai: building rigorous datasets, benchmarks, and evaluation frameworks that better reflect how AI systems perform in the real world. As AI moves into increasingly complex, high-stakes domains, measuring what models know is important - but we also need to measure how they respond when reality changes.
I'm grateful to my colleagues and collaborators, especially @engyziedan and Wes Hopkins, as well as the medical professionals who helped validate the results.
We're excited to see @DataLabResearch@TurkMatthew’s new paper, “Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents,” accepted to the inaugural RLEval Workshop at ACM CAIS 2026 and selected for an invited talk.
🔍️ The guiding research question: "When clinically important patient facts change, does the model appropriately change its recommendations?"
👉️ If you change a key clinical detail that changes the case context, the model should change its recommendation.
But if a clinically meaningful fact changes and the model doesn't update its recommendation, that counterfactual test exposes the gap. The model wasn't truly reasoning through the specific case in front of it — it was relying on surface patterns or the general shape of the case rather than the patient's actual circumstances.
🚨 What Matt found: "Producing the right answer and responding appropriately to new information are distinct capabilities - and future evaluation frameworks need both."
‼️ Why this matters: This connects directly to one of the hardest problems in benchmark design. It's not enough to measure whether a model arrives at the right answer. We also need to know whether it would arrive at a different answer when the underlying facts change.
This is the kind of benchmark-design question that @engyziedan's @DataLabResearch focus on. Better benchmarks require more than held-out datasets.
They require realistic, ground-truth evaluation frameworks that can distinguish between a model that genuinely updates its reasoning and one that reaches the correct answer for the wrong reasons.
Congratulations to @TurkMatthew, and we're excited about the cutting-edge benchmark and evaluation work happening across the Protege DataLab.
Today, I’m excited to announce our newest vertical: Spatial & Physical Intelligence.
We have been investing in an entirely new category focused on supporting world models and robotics labs. From working closely with the labs in both domains we've noticed a consistent pattern: they end up needing the same underlying training data.
Our thesis is anchored to four fundamental data types that are important for this development stage:
1) Ego- and Exo-centric Video: First and third-person footage of humans performing real-world tasks and vehicle-based captures of dynamic environments. Depth data, LiDAR, hand tracking, descriptive annotations, overlapping camera views, and time-synced data all increase spatial understanding.
2) Motion Capture: Mapping the "physics of the mundane," moving beyond entertainment and gaming to capture tactile object manipulation, locomotion, and human to human interactions.
3) Video Gameplay Data: Studio-grade simulated environments paired with precise player telemetry.
4) 3D Assets: 3D scans of objects & scenes including raw input files before construction.
The Core Challenge: data is siloed.
High quality training data is trapped in fragmented datasets instead of accessible in a unified data layer. Robotics and world models are developing more quickly than ever – and the data layer should move just as fast. We’re here to help build high-quality, content-rich datasets that create the data supply chain for Spatial & Physical Intelligence data for AI.
This is a key area where we're scaling at Protege – we’re actively looking for builders and feedback.
Come build with us and tell us what we’re missing!
(see 🧵 below for more details)