Robert Davies

@robbie_stats

Lead Scientist, Genomics plc, working out of Ottawa 🇨🇦. Formerly an Associate Professor in the Department of Statistics, Oxford.

Joined July 2016

429 Following

448 Followers

428 Posts

robbie_stats retweeted

Jorge Bravo Abad

@bravo_abad

13 days ago

No scaling laws for single-cell foundation models: when bigger atlases stop teaching the model anything In language and vision, the recipe has been simple: more data, bigger models, better performance. Single-cell biology borrowed that playbook. Foundation models for transcriptomics jumped from 1 million cells to atlases of over 100 million, on the assumption that scale would unlock the same gains. Alan DenAdel and coauthors put that assumption to the test, and the result is sobering. Working from a 22.2-million-cell corpus, they pretrained 400 models across five architectures (from PCA and a variational autoencoder up to the Geneformer transformer) and ran 6,400 evaluation experiments. They varied not just dataset size (1% to 75%) but also diversity, using cell-type re-weighting and geometric sketching to deliberately enrich rare cell types and transcriptional states. The finding: performance saturates almost immediately. On cell-type classification, batch integration, and perturbation prediction, most models hit their ceiling at roughly 1% of the corpus, about 200,000 cells. Beyond that, adding millions more cells changed essentially nothing. More diversity didn't help. Even spiking in genome-scale Perturb-seq data, to give the models perturbed phenotypes rather than just healthy ones, failed to move the needle. Larger models did score better overall, but they too plateaued early on data. Two points stood out. Simple baselines (PCA, logistic regression) often matched or beat the transformers. And the strongest model, SCimilarity, won not because of size but because its contrastive training objective is aligned with the downstream task. For single-cell data, what you train on and how you frame the objective matters far more than how much you collect. This reframes a quiet but expensive habit. In drug discovery, biotech, and any pipeline leaning on cell atlases, the instinct to keep scaling pretraining corpora may be burning compute for no return. The real leverage sits elsewhere: curating high-quality, task-relevant data and matching the training objective to the actual question you're trying to answer. Paper: DenAdel et al., journal license | https://t.co/X7GxoxF5U5

bravo_abad's tweet photo. No scaling laws for single-cell foundation models: when bigger atlases stop teaching the model anything

In language and vision, the recipe has been simple: more data, bigger models, better performance. Single-cell biology borrowed that playbook. Foundation models for transcriptomics jumped from 1 million cells to atlases of over 100 million, on the assumption that scale would unlock the same gains. Alan DenAdel and coauthors put that assumption to the test, and the result is sobering.

Working from a 22.2-million-cell corpus, they pretrained 400 models across five architectures (from PCA and a variational autoencoder up to the Geneformer transformer) and ran 6,400 evaluation experiments. They varied not just dataset size (1% to 75%) but also diversity, using cell-type re-weighting and geometric sketching to deliberately enrich rare cell types and transcriptional states.

The finding: performance saturates almost immediately. On cell-type classification, batch integration, and perturbation prediction, most models hit their ceiling at roughly 1% of the corpus, about 200,000 cells. Beyond that, adding millions more cells changed essentially nothing. More diversity didn't help. Even spiking in genome-scale Perturb-seq data, to give the models perturbed phenotypes rather than just healthy ones, failed to move the needle. Larger models did score better overall, but they too plateaued early on data.

Two points stood out. Simple baselines (PCA, logistic regression) often matched or beat the transformers. And the strongest model, SCimilarity, won not because of size but because its contrastive training objective is aligned with the downstream task. For single-cell data, what you train on and how you frame the objective matters far more than how much you collect.

This reframes a quiet but expensive habit. In drug discovery, biotech, and any pipeline leaning on cell atlases, the instinct to keep scaling pretraining corpora may be burning compute for no return. The real leverage sits elsewhere: curating high-quality, task-relevant data and matching the training objective to the actual question you're trying to answer.

Paper: DenAdel et al., journal license | https://t.co/X7GxoxF5U5

382

277

97K

robbie_stats retweeted

Marios Georgakis

@MariosGeorgakis

19 days ago

This new paper is probably the most prominent example to date of how linking genetic variation to cell-level, rather than tissue-level, gene expression can transform the interpretation of GWAS signals. The study generates a single-cell eQTL resource from intestinal biopsies and blood samples from 421 individuals, including 125 with inflammatory bowel disease (IBD)👇

MariosGeorgakis's tweet photo. This new paper is probably the most prominent example to date of how linking genetic variation to cell-level, rather than tissue-level, gene expression can transform the interpretation of GWAS signals.

The study generates a single-cell eQTL resource from intestinal biopsies and blood samples from 421 individuals, including 125 with inflammatory bowel disease (IBD)👇

170

100

14K

robbie_stats retweeted

Omer Weissbrod @oweissb

20 days ago

Last week we've gathered 80 colleagues from the biotech industry to discuss how they're integrating AI agents into data teams. We wrote seven quick takeaways on how agents are already transforming biotech >>

oweissb's tweet photo. Last week we've gathered 80 colleagues from the biotech industry to discuss how they're integrating AI agents into data teams.

We wrote seven quick takeaways on how agents are already transforming biotech >> https://t.co/BwkULvOQaK

robbie_stats retweeted

gokcen @gokcen

27 days ago

Excited to share Decima, out now in @naturemethods! 🎉 Existing seq-to-function models predict bulk expression. Decima goes further: it predicts gene expression in specific cell types and disease states from DNA sequence alone — trained on 22M+ single cells. Applications: cis-regulatory mechanisms, cell-type-resolved variant effect prediction, and designing context-specific regulatory DNA

145

14K

Who to follow

Omer Weissbrod

@oweissb

If we knew what we were doing it wouldn't be called (data) science

Director, Trans-Divisional Research Program, Division of Cancer Epidemiology and Genetics, NCI, NIH | @peter-kraft.bsky.social | Views are my own.

robbie_stats retweeted

Andy Dahl @andywdahl

about 1 month ago

Analyzing phenotypes on the wrong scale reduces power and creates spurious genetic interactions. SIQReg, a new method from my postdoc Zhenhong Huang, learns an optimal phenotype scale to fix this: https://t.co/u8QzlAEbNv

robbie_stats retweeted

Andrea ganna @andganna

about 2 months ago

New preprint on a controversial topic: Through which socioeconomic mechanisms do individuals with higher genetic predisposition for educational attainment end up with higher income over their working life? We study this using uniquely rich Finnish data.

17K

robbie_stats retweeted

Simon Myers @simon_r_myers

2 months ago

New preprint led by Hrushikesh Loya, Leo Speidel, and I where we introduce GhostBuster! https://t.co/MgfPEVeHQJ Our method uses genealogies to find "ghost" ancestries hidden within DNA. We find both modern humans, Neanderthals formed as mixtures of two ancient hominin groups

robbie_stats retweeted

Alex Strudwick Young

@AlexTISYoung

2 months ago

My full interview on TBPN where I talk about: -polygenic prediction in IVF -regulation of genetic testing in IVF -the utility of biobanks -my own experience as a cancer patient -the future of reproductive medicine A much better discussion than Kian Sadeghi on Tucker...

123

17K

robbie_stats retweeted

Anshul Kundaje @anshulkundaje

2 months ago

I usually dump on EVO2 quite a bit (mostly since it's been overhyped to death beyond it's actual capabilities), but this is nice work showing that a better embedding approach with supervised probing can deliver strong performance on coding variant effect prediction. 1/

231

153

28K

robbie_stats retweeted

William J. Greenleaf @WJGreenleaf

2 months ago

Our Human Multiomic Development Atlas paper is out in Nature today! A heart-felt "thank you" to all co-authors for their tireless work on this complex yet exciting project! Congrats all! https://t.co/iUiZz00KOt

354

142

23K

robbie_stats retweeted

Adam Auton @adamauton

3 months ago

Delighted to share our latest research from the @23andMeResearch Team, just published in @Nature ! We looked at data from >27,000 participants to uncover how human genetics influences weight loss efficacy and side effects of GLP-1 medications like semaglutide. A thread 🧵👇

235

117

60K

robbie_stats retweeted

Doug Speed @dougthespeed

3 months ago

StratGWAS is our new tool for more efficient GWAS of heterogeneous diseases. Instead of treating cases equal, it weights them based on relevant phenotypic information such as medication use, age of onset or recruitment strategy. Full details on MedrXIv https://t.co/J6ANjdmztL.

robbie_stats retweeted

Sasha Gusev @SashaGusevPosts

3 months ago

Monthly median Received to Accepted time (days) at Nature Genetics

234

150K

robbie_stats retweeted

Shicheng Guo @ShihchengGuo

3 months ago

Discover PARM, a deep-learning model predicting human promoter activity with precision. Major leap for genomics, decoding cellular responses. PMID:41639451, Nature 2026, @Nature https://t.co/CB3gO9QnZm #AI #Pharma #BioMed #RNA #ASHG #ESHG

robbie_stats retweeted

Ryan Corces @ryancorces

4 months ago

Just posted a preprint on a huge new single-cell study of Parkinson’s and its application to understanding noncoding variation. 🧵below. Led by Shreya Menon and Adam Turner and in collab w/ GP2 (@ASAP_Research), @BelloyMichael, @ZihHuaFang and others. https://t.co/tBKZgq5bQ7

133

10K

robbie_stats retweeted

nature

@Nature

4 months ago

Nature research paper: Functional dissection of complex trait variants at single-nucleotide resolution https://t.co/Rf1GsXPsCE

153

19K

robbie_stats retweeted

Na Cai @caina89

4 months ago

Our new preprint “Learning lifetime disease liability reveals and removes genetic confounding in electronic health records” is now online! Link to paper: https://t.co/7UNDZkHvP4 This work is led by my postdoc @diyazheng_ and it’s our first project at @ETH_BSSE :) Thread 1/n

11K

robbie_stats retweeted

PFT Commenter

@PFTCommenter

4 months ago

15 years ago the entire nation of Canada synced up there piss breaks during the gold medal game. This is what were up against folks we have to lock in Sunday

PFTCommenter's tweet photo. 15 years ago the entire nation of Canada synced up there piss breaks during the gold medal game. This is what were up against folks we have to lock in Sunday https://t.co/Och1gGMPAR

12K

549

777

971K

robbie_stats retweeted

David Kelley @drklly

4 months ago

We're hiring a postdoc in my group at @calico! We develop deep learning methods for regulatory genomics to predict how every nucleotide in the genome affects cell-type-specific gene regulation.

141

16K

robbie_stats retweeted

Jeremy

@jeremyli__

4 months ago

Today we’re announcing an algorithmic breakthrough. Herasight’s ImputePGTA algorithm has enabled couples around the world to access polygenic embryo testing from routine IVF data (PGT-A). Now it yields substantially higher accuracy, especially for underrepresented ancestries

jeremyli__'s tweet photo. Today we’re announcing an algorithmic breakthrough.

Herasight’s ImputePGTA algorithm has enabled couples around the world to access polygenic embryo testing from routine IVF data (PGT-A).

Now it yields substantially higher accuracy, especially for underrepresented ancestries https://t.co/AMdndARRZi

147

46K

Robert Davies

@robbie_stats

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users