No scaling laws for single-cell foundation models: when bigger atlases stop teaching the model anything
In language and vision, the recipe has been simple: more data, bigger models, better performance. Single-cell biology borrowed that playbook. Foundation models for transcriptomics jumped from 1 million cells to atlases of over 100 million, on the assumption that scale would unlock the same gains. Alan DenAdel and coauthors put that assumption to the test, and the result is sobering.
Working from a 22.2-million-cell corpus, they pretrained 400 models across five architectures (from PCA and a variational autoencoder up to the Geneformer transformer) and ran 6,400 evaluation experiments. They varied not just dataset size (1% to 75%) but also diversity, using cell-type re-weighting and geometric sketching to deliberately enrich rare cell types and transcriptional states.
The finding: performance saturates almost immediately. On cell-type classification, batch integration, and perturbation prediction, most models hit their ceiling at roughly 1% of the corpus, about 200,000 cells. Beyond that, adding millions more cells changed essentially nothing. More diversity didn't help. Even spiking in genome-scale Perturb-seq data, to give the models perturbed phenotypes rather than just healthy ones, failed to move the needle. Larger models did score better overall, but they too plateaued early on data.
Two points stood out. Simple baselines (PCA, logistic regression) often matched or beat the transformers. And the strongest model, SCimilarity, won not because of size but because its contrastive training objective is aligned with the downstream task. For single-cell data, what you train on and how you frame the objective matters far more than how much you collect.
This reframes a quiet but expensive habit. In drug discovery, biotech, and any pipeline leaning on cell atlases, the instinct to keep scaling pretraining corpora may be burning compute for no return. The real leverage sits elsewhere: curating high-quality, task-relevant data and matching the training objective to the actual question you're trying to answer.
Paper: DenAdel et al., journal license | https://t.co/X7GxoxF5U5
1/ Our new study, led by @ding5066, examines the role of transcription factors during human neurogenesis to identify gene regulatory networks influencing cell fate, maturation, and subtype specification
https://t.co/GDtKk45kFt
Congratulations to Tomasz Nowakowski, PhD, on being named a finalist for the prestigious Blavatnik National Award for Young Scientists! 🎉 He was selected for his groundbreaking research shaping the future of neuroscience and medicine. @BlavatnikAwards https://t.co/KLOkD0em1P
Not only does UCSF's NIH-funded research advance health care and improve patients' lives, it has an estimated $18.7B ripple effect on the economy. The result is more innovative startups, more jobs, and stronger companies that hire workers nationwide. https://t.co/zi7zp0LFlG
What gave human brains the edge over apes? UCSF researchers found that tiny DNA changes helped neurons form more connections, driving complex thinking. But this evolution may also impact neurodevelopment. https://t.co/2hHUievtP5
A fantastic afternoon spent talking to the Arsenal Parkinson’s Walking Football squad. Such an honour to have contributed to our club’s rich history of impactful work in the community! @ParkinsonsUK@WhitHealth@uclh@Arsenal
Starting week 2 of the UCSF Cellular Electronics Minicourse - building logic gates with transistors, in order to better understand how logic can be implemented with proteins and DNA.