Mathew Koretsky @mkoretsky1 - Twitter Profile

The Lancet Digital Health @LancetDigitalH

4 months ago

NEW Resource: CARDBiomedBench: a benchmark for evaluating the performance of #LLMs in biomedical research. Read it here: https://t.co/Sjee2ZAB69

LancetDigitalH's tweet photo. NEW Resource: CARDBiomedBench: a benchmark for evaluating the performance of #LLMs in biomedical research.

Read it here: https://t.co/Sjee2ZAB69 https://t.co/aEmIgFQBLA

0

5

0

593

mkoretsky1 retweeted

Faraz Faghri @FarazFaghri

5 months ago

🚀 We just launched DTAgent-AD, a scientific reasoning agent built to accelerate biomedical research and specialized Alzheimer’s & neurodegenerative disease research. DTAgent-AD is a domain-tuned agent trained on trace data from real biomedical tasks, not a general chatbot.

FarazFaghri's tweet photo. 🚀 We just launched DTAgent-AD, a scientific reasoning agent built to accelerate biomedical research and specialized Alzheimer’s & neurodegenerative disease research.

DTAgent-AD is a domain-tuned agent trained on trace data from real biomedical tasks, not a general chatbot. https://t.co/KJwswh8VUa

2

36

8

6K

Mathew Koretsky @mkoretsky1

5 months ago

Stay tuned to see how the biomedical agents we’re building at @DataTecnica stack up against these models!

0

2

0

118

Mathew Koretsky @mkoretsky1

5 months ago

Despite boasting impressive performance across a range of categories, the latest frontier LLMs (Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2) still struggle to balance accuracy and safety on CARDBiomedBench, our biomedical QA benchmark 👀

Faraz Faghri @FarazFaghri

5 months ago

Frontier models are moving fast, but are they getting better at biomedical research? We just ran a fresh benchmark update using CARDBiomedBench, our evaluation suite for genetics, disease associations, and drug discovery QA. Instead of looking only at “did it answer?”

FarazFaghri's tweet photo. Frontier models are moving fast, but are they getting better at biomedical research?
We just ran a fresh benchmark update using CARDBiomedBench, our evaluation suite for genetics, disease associations, and drug discovery QA. Instead of looking only at “did it answer?” https://t.co/4U0p2ZvLC8

1

8

3

0

431

1

0

98

Who to follow

Mike A. Nalls

@mike_nalls

Founder at @DataTecnica, team lead at NIH’s CARD. Data science | healthcare | biotech. “oh, you really still skateboard?” Views/tweets my own.

Faraz Faghri

@FarazFaghri

Computer scientist | Investigator/consultant @NIH, @DataTecnica | Health, Aging, Neurogenetics, Alzheimer, Parkinson, ALS | All views/tweets my own

mcornejo

@mario_cornejo_o

Mathew Koretsky @mkoretsky1

7 months ago

Stay tuned to see how our Agents stack up against the latest base LLMs on biomedical question-answering tasks!

0

22

Mathew Koretsky @mkoretsky1

7 months ago

👀 Our new Knowledge Agents make https://t.co/fknNWdWJ4k more powerful than ever. 🧬 Get insights backed by the journals and databases that biomedical researchers use daily. 📄 Read the blog more more info: https://t.co/bSN57vdZ97

Faraz Faghri @FarazFaghri

7 months ago

🧬 New at https://t.co/O7B5jBk1Jm: smarter Biomedical Knowledge Agents + Knowledge Mode We just shipped the latest update to https://t.co/O7B5jBk1Jm, the world’s first platform for benchmarking LLMs on biomedical research tasks.

1

38

9

17

9K

1

0

82

Mathew Koretsky @mkoretsky1

8 months ago

We continue to evaluate these new models on our benchmark, CARDBiomedBench. Despite significant progress, there are still no models that balance response accuracy and safety on biomedical questions 👀

BiomedArena.AI @BiomedArena

8 months ago

We evaluated 12 top models using CARDBiomedBench, a biomedical benchmark with 68K+ expert QA pairs across GWAS, SMR, drug discovery & more. 🧠 No model aced both safety and accuracy. 🤖 GPT-4o = bold but risky 🤔 Claude-4.0 = cautious but wrong More is coming soon.

BiomedArena's tweet photo. We evaluated 12 top models using CARDBiomedBench, a biomedical benchmark with 68K+ expert QA pairs across GWAS, SMR, drug discovery & more.
🧠 No model aced both safety and accuracy.
🤖 GPT-4o = bold but risky
🤔 Claude-4.0 = cautious but wrong
More is coming soon. https://t.co/MTMhBK8Dji

1

0

250

0

83

Mathew Koretsky @mkoretsky1

8 months ago

Check out the latest models in @BiomedArena for all of your biomedical research questions!

BiomedArena.AI @BiomedArena

8 months ago

🚀 New LLMs now LIVE on BiomedArena 🧬 Test GPT-5, Claude-4.1, Gemini 2.5 and more, on your toughest biomedical queries. All free. All benchmarked. https://t.co/kzNqodlHuk 📉 Can AI be accurate and safe in biomedicine? See the surprising results 👇🧵

BiomedArena's tweet photo. 🚀 New LLMs now LIVE on BiomedArena 🧬
Test GPT-5, Claude-4.1, Gemini 2.5 and more, on your toughest biomedical queries.
All free. All benchmarked.
https://t.co/kzNqodlHuk
📉 Can AI be accurate and safe in biomedicine?
See the surprising results 👇🧵 https://t.co/LcJ23o0mf3

1

22

4

9

5K

0

1

0

62

Mathew Koretsky @mkoretsky1

10 months ago

Super proud of all the hard work from our team including @tanaynayak99, @owenbianchi_, Shayan Shahand, @DanielKhashabi, and @FarazFaghri!!!

0

2

0

331

Mathew Koretsky @mkoretsky1

10 months ago

🚨BiomedArena is live🚨 In a partnership with @lmarena_ai, our team at @DataTecnica has released a feedback-rich platform to evaluate LLM performance on real-world biomedical questions. ⚔️Access the arena: https://t.co/LjPiINtIfi 📄Read the blog post: https://t.co/hcpwv6Do8Z

Arena.ai

@arena

10 months ago

🧬 BiomedArena is here! We’re honored to partner with @DataTecnica and @NIH CARD, who developed BiomedArena to evaluate LLMs for biomedical discovery, and to help expand this domain-specific track in community-driven evaluations. 🧪 Biomedical science is complex, high-stakes, and constantly evolving. 📊 CARDBiomedBench & tabular reasoning tests show that no current model can reliably meet the reasoning & domain-specific knowledge demands of biomedical researchers. Learn more about BiomedArena in thread 👇 🧵 #AI #LLMs #BiomedicalAI #AIEvaluation #OpenScience #LMArena #BiomedArena #NIH

arena's tweet photo. 🧬 BiomedArena is here!

We’re honored to partner with @DataTecnica and @NIH CARD, who developed BiomedArena to evaluate LLMs for biomedical discovery, and to help expand this domain-specific track in community-driven evaluations.

🧪 Biomedical science is complex, high-stakes, and constantly evolving.

📊 CARDBiomedBench & tabular reasoning tests show that no current model can reliably meet the reasoning & domain-specific knowledge demands of biomedical researchers.

Learn more about BiomedArena in thread 👇 🧵
#AI #LLMs #BiomedicalAI #AIEvaluation #OpenScience #LMArena #BiomedArena #NIH

6

207

19

48

19K

1

8

3

0

854

mkoretsky1 retweeted

Daniel Khashabi 🕊️

@DanielKhashabi

about 1 year ago

🚨New LLM benchmark🚨 We're releasing BiomedSQL🔬 for tabular reasoning over large-scale biomedical databases. This includes questions based on implicit scientific conventions—like statistical thresholds, effect direction, and drug approval status. 📄 Preprint: https://t.co/QlF4kPYfnp 📊 Dataset: https://t.co/j1jC4aqwEp Lead by Matt Koretsky at @DataTecnica

DanielKhashabi's tweet photo. 🚨New LLM benchmark🚨 We're releasing BiomedSQL🔬 for tabular reasoning over large-scale biomedical databases. This includes questions based on implicit scientific conventions—like statistical thresholds, effect direction, and drug approval status.

📄 Preprint: https://t.co/QlF4kPYfnp
📊 Dataset: https://t.co/j1jC4aqwEp

Lead by Matt Koretsky at @DataTecnica

0

15

6

4

1K

Mathew Koretsky @mkoretsky1

about 1 year ago

📄Read the preprint: https://t.co/hl9BPIxT5a 📊Dataset: https://t.co/y6MPciJSwb 💻Code: https://t.co/Qlbl33E0RL Thanks to my teammates at NIH/CARD and @DataTecnica including Maya Willey, Adi Asija, @owenbianchi, Chelsea Alvarado, @mike_nalls, @DanielKhashabi, and @FarazFaghri

0

1

0

189

Mathew Koretsky @mkoretsky1

about 1 year ago

Can LLMs perform reliably as biomedical data analysts? TL;DR: We created the first benchmark designed to challenge LLMs ability to apply scientific reasoning in text-to-SQL generation over biomedical databases, revealing a 30-40% gap between SOTA models and expert performance

1

0

229

Mathew Koretsky @mkoretsky1

about 1 year ago

We believe this benchmark is a critical step towards building trustworthy text-to-SQL systems that can increase efficiency of lookups for PIs and SMEs, democratize access to biomedical knowledge, and accelerate discovery

1

0

49

mkoretsky1 retweeted

Daniel Khashabi 🕊️

@DanielKhashabi

about 1 year ago

Long-form inputs (e.g., needle-in-haystack setups) are the crucial aspect of high-impact LLM applications. While previous studies have flagged issues like positional bias and distracting documents, they've missed a crucial element: the size of the gold/relevant context. In our latest study, we look into how the size of these gold contexts impacts LLM performance in needle-in-a-haystack scenarios. The verdict? **Smaller gold contexts severely amplify positional bias.** Why should you care? If you're developing LLMs to sift through large number of documents of varying sizes, beware: a smaller gold document among larger distractions can throw your pipeline off course. Basically, practitioners needs to keep an eye not only on the position of the likely gold document but also on its size relative to others. 📄Read the preprint: https://t.co/ZWQJ9IiMve Work lead by Owen Bianchi and other collaborators at @DataTecnica

DanielKhashabi's tweet photo. Long-form inputs (e.g., needle-in-haystack setups) are the crucial aspect of high-impact LLM applications. While previous studies have flagged issues like positional bias and distracting documents, they've missed a crucial element: the size of the gold/relevant context.

In our latest study, we look into how the size of these gold contexts impacts LLM performance in needle-in-a-haystack scenarios. The verdict? **Smaller gold contexts severely amplify positional bias.**

Why should you care? If you're developing LLMs to sift through large number of documents of varying sizes, beware: a smaller gold document among larger distractions can throw your pipeline off course. Basically, practitioners needs to keep an eye not only on the position of the likely gold document but also on its size relative to others.

📄Read the preprint: https://t.co/ZWQJ9IiMve

Work lead by Owen Bianchi and other collaborators at @DataTecnica

3

52

18

16

4K

mkoretsky1 retweeted

DataTecnica @DataTecnica

over 1 year ago

Great work from many of our teammates! Let's accelerate data harmonization!

0

2

1

0

341

mkoretsky1 retweeted

bioRxiv Genomics @biorxiv_genomic

about 2 years ago

GenoTools: An Open-Source Python Package for Efficient Genotype Data Quality Control and Analysis https://t.co/AkpxXvHFFn #biorxiv_genomic

0

3

2

0

940

mkoretsky1 retweeted

Brain @Brain1878

almost 3 years ago

Koretsky et al. use genome-wide data to cluster patients based on genetic status across risk variants for five neurodegenerative disorders. The results suggest that neurodegenerative diseases have more overlapping genetic aetiology than previously assumed. https://t.co/TD3fv3mdFx

Brain1878's tweet photo. Koretsky et al. use genome-wide data to cluster patients based on genetic status across risk variants for five neurodegenerative disorders. The results suggest that neurodegenerative diseases have more overlapping genetic aetiology than previously assumed. https://t.co/TD3fv3mdFx https://t.co/1Mrc0y5fnH

0

27

8

6

9K

Mathew Koretsky

@mkoretsky1

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users