matt turk

@TurkMatthew

ML researcher @withprotegeai prev: ML @cleanlabAI @goodwatercap, Quant @coinbase & @goldmansachs, EECS @ucberkeley

New York, NY

Joined March 2012

2.1K Following

707 Followers

1.4K Posts

Pinned Tweet

matt turk

@TurkMatthew

3 months ago

Excited to share that I’ve joined @withprotegeai as a Senior Machine Learning Researcher on the DataLab team. After 2 years at @CleanlabAI, working with the team was an incredibly formative experience. I’m deeply grateful for the chance I had to learn from them to work on data-centric AI with such thoughtful researchers and builders, and to contribute during a period that ultimately led to Cleanlab being acquired into @joinHandshake AI. I learned a tremendous amount about the importance of data quality, evaluation, and trustworthiness in modern AI systems to make them more accurate and reliable. Throughout my time there, my conviction only grew that the next major advances in AI will come not just from better models or more compute, but from better data. At DataLab, our goal is to treat the data layer of AI with the same scientific rigor that model labs apply to algorithms by building a dedicated research institution for AI data: designing high-fidelity datasets and multimodal benchmarks grounded in real-world scenarios, working closely with frontier labs on their hardest data challenges, and developing standardized ways, including “FICO scores for AI data”, to measure dataset quality, contamination, and benchmark reliability. Another important piece of this work is understanding how different kinds of data support different parts of the AI training stack. Reinforcement learning (RL) environments are a powerful form of training data that generate structured training tuples like (state, action, reward, next state) and are extremely useful for post-training optimization when the world can be simulated. But many of the highest-value domains for AI, including healthcare, enterprise workflows, and complex multimodal reasoning, cannot be faithfully simulated. Advancing models in these areas requires real-world datasets, carefully designed benchmarks, and domain-specific data for pre-training and mid-training adaptation. The idea behind DataLab is simple but important: every major leap in AI capability has historically followed a breakthrough in data (from ImageNet to large-scale web corpora). As models and compute continue to advance rapidly, closing the data gap, the gap between the data that AI systems need and the data that actually exists in usable form, may be one of the most important challenges for the field. Here is more info on some of the work the team has done so far: https://t.co/fzgkgOPL7b

Bobby Samuels

@BobbySamuels

3 months ago

https://t.co/0muCOoj0BN

527

796

302K

923

TurkMatthew retweeted

Protege

@withprotegeai

6 days ago

We're excited to see @DataLabResearch @TurkMatthew’s new paper, “Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents,” accepted to the inaugural RLEval Workshop at ACM CAIS 2026 and selected for an invited talk. 🔍️ The guiding research question: "When clinically important patient facts change, does the model appropriately change its recommendations?" 👉️ If you change a key clinical detail that changes the case context, the model should change its recommendation. But if a clinically meaningful fact changes and the model doesn't update its recommendation, that counterfactual test exposes the gap. The model wasn't truly reasoning through the specific case in front of it — it was relying on surface patterns or the general shape of the case rather than the patient's actual circumstances. 🚨 What Matt found: "Producing the right answer and responding appropriately to new information are distinct capabilities - and future evaluation frameworks need both." ‼️ Why this matters: This connects directly to one of the hardest problems in benchmark design. It's not enough to measure whether a model arrives at the right answer. We also need to know whether it would arrive at a different answer when the underlying facts change. This is the kind of benchmark-design question that @engyziedan's @DataLabResearch focus on. Better benchmarks require more than held-out datasets. They require realistic, ground-truth evaluation frameworks that can distinguish between a model that genuinely updates its reasoning and one that reaches the correct answer for the wrong reasons. Congratulations to @TurkMatthew, and we're excited about the cutting-edge benchmark and evaluation work happening across the Protege DataLab.

145

matt turk

@TurkMatthew

6 days ago

@arxiv @CAISconf You can see the workshop and accepted papers here: https://t.co/FGPo0AVP4R

matt turk

@TurkMatthew

6 days ago

Excited to share that my paper, "Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents", is now available on @arxiv (link in the comments). The paper was accepted to the inaugural RLEval Workshop at @CAISconf- a workshop focused on methods and reinforcement learning environments for evaluating AI agents - and was selected for an invited talk based on reviewer ratings. LLM evaluation is difficult. Models that look equally capable on traditional benchmarks can behave very differently when the underlying facts change. Most current benchmarks focus on whether a model's output looks correct. But in real-world settings, especially in healthcare, what often also matters is whether the model appropriately updates its recommendations when the underlying facts change. In this work, I introduce the Causal Sensitivity Score (CSS), a pre-registered counterfactual evaluation framework designed to measure exactly that: "When clinically important patient facts change, does the model appropriately change its recommendations?" Across six frontier models and several hundred oncology tumor board cases, I found that models with very similar performance on standard coverage-based metrics often behave dramatically differently under counterfactual interventions. In fact, model rankings were nearly reversed depending on whether you measured coverage or responsiveness. The paper also shows that these findings transfer to tool-using agents, revealing failure modes that remain hidden under conventional evaluation approaches. The broader takeaway is that producing the right answer and responding appropriately to new information are distinct capabilities - and future evaluation frameworks should measure both. This work is also closely aligned with the mission of our DataLab at @withprotegeai: building rigorous datasets, benchmarks, and evaluation frameworks that better reflect how AI systems perform in the real world. As AI moves into increasingly complex, high-stakes domains, measuring what models know is important - but we also need to measure how they respond when reality changes. I'm grateful to my colleagues and collaborators, especially @engyziedan and Wes Hopkins, as well as the medical professionals who helped validate the results.

321

Who to follow

Jeff Clune

@jeffclune

Co-founder, Recursive. Professor, CS, U. British Columbia. CIFAR AI Chair, Vector Institute. | ML, AI, deep RL, deep learning, AI-Generating Algorithms (AI-GAs)

Airbnb Engineering

@AirbnbEng

Creative engineers building a world where you can belong anywhere.

Lightning AI ⚡️

@LightningAI

The AI omnicloud PyTorch developers love. Made the first AI Studio & PyTorch Lightning. Get help: https://t.co/a69wnEBpKH

matt turk

@TurkMatthew

6 days ago

@arxiv @CAISconf Link to my paper: https://t.co/kVpnTdN5y0

matt turk

@TurkMatthew

7 days ago

@HarryStebbings @nico_laqua Grindslop

143

TurkMatthew retweeted

Zachary Lipton

@zacharylipton

11 days ago

deep learning research was the original vibe math

281

47K

TurkMatthew retweeted

Mayor Zohran Kwame Mamdani

@NYCMayor

12 days ago

.@NYCSanitation I'd like to report a sweep

289K

31K

11M

14 days ago

15 days ago

Are people on LinkedIn aware that one day we are all going to die

150

11K

379

466K

TurkMatthew retweeted

Trace Cohen

@Trace_Cohen

15 days ago

NYC summers hate your weekends. I analyzed 3 years of Central Park rain data to see if the feeling was real. It was. 27 of 38 summer weekends had measurable rain. That’s 71%. Friday was the rainiest day of the week, with rain on 43.6% of summer Fridays. Basically worse than a coin flip for your evening plans. Sunday had the highest rainfall volume of any day, averaging 0.17 inches. It may not rain every Sunday, but when it does, it commits. Thursday was objectively the best day to be outside in NYC: lowest precipitation, clearest skies, least weekend-related misery. The wildest stat: 70.8% of rainy Sundays were preceded by a gross Friday or Saturday. The weekend basically telegraphs its own downfall. I built a full dashboard using NWS Central Park data with every weekend tracked and every raindrop counted:

Trace_Cohen's tweet photo. NYC summers hate your weekends.

I analyzed 3 years of Central Park rain data to see if the feeling was real.

It was.

27 of 38 summer weekends had measurable rain.

That’s 71%.

Friday was the rainiest day of the week, with rain on 43.6% of summer Fridays. Basically worse than a coin flip for your evening plans.

Sunday had the highest rainfall volume of any day, averaging 0.17 inches. It may not rain every Sunday, but when it does, it commits.

Thursday was objectively the best day to be outside in NYC: lowest precipitation, clearest skies, least weekend-related misery.

The wildest stat:

70.8% of rainy Sundays were preceded by a gross Friday or Saturday.

The weekend basically telegraphs its own downfall.

I built a full dashboard using NWS Central Park data with every weekend tracked and every raindrop counted:

483

matt turk

@TurkMatthew

15 days ago

@dair_ai Way too noisy of a process to forecast

matt turk

@TurkMatthew

17 days ago

Top-tier read calling out performative grind culture. “Great work has always demanded sacrifice and often brutal hours and I'm not disputing this. What I'm disputing is the direction. These people, many of them friends, have more economic freedom than any class in history and they've chosen, freely, to simulate the conditions of a Chinese assembly line and call it virtue.”

Will Manidis

@WillManidis

18 days ago

https://t.co/aHOHtb1rcW

178

333

145

TurkMatthew retweeted

OpenAI

@OpenAI

18 days ago

Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946. For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids. An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better. This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

27K

14M

matt turk

@TurkMatthew

18 days ago

@4JimLee @bevedoni @karpathy Karpathy will not lead the business side he will do what he does best and lead R&D (specifically accelerating pre-training research)

matt turk

@TurkMatthew

19 days ago

@vijaythirumalai This is the most unsurprising choice

TurkMatthew retweeted

Bleacher Report

@BleacherReport

19 days ago

STILL NOT OVER THIS SHOT 😭 WHO WAS WEMBY FEELING LIKE??

177K

14K

TurkMatthew retweeted

Goldie

@dezgoldie

19 days ago

okc/spurs could be the best playoff series since 2016 finals

296

26K

TurkMatthew retweeted

Everything Price Sufferer (but especially eggs) @agraybee

20 days ago

I didn't know Helen of Troy could generate so much conflict.

82K

10K

matt turk

@TurkMatthew

23 days ago

@phoebeyao Completely agree!

144

TurkMatthew retweeted

ak0

@annanay

24 days ago

UPDATE turns out that yes, nobody has heard of tmux.

100

307

319K

TurkMatthew retweeted

Little Kevin 5, cpa

@pootsobotka

25 days ago

Hey man sorry I’m busy this weekend i can’t chill - Friday night i gotta wait in the pizza line in the west village from 6-9 and then wait on line to get into the Spaniard from 9-12 - Saturday morning gotta wait on the blank st coffee line from 8-12 -Saturday afternoon New York or nowhere line from 12-4 - Saturday night i have the frozen yogurt line from 4-12 If you wanna hang Sunday morning I’m waiting on a bagel line from 9-1 let me know

100

199

190K

matt turk

@TurkMatthew

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users