We started Theorema a few months ago with one vision: autonomous science. Self-driving labs that design, run, and interpret their own experiments, with the AI doing the science.
Today we share our first preprint.
First, the problem it addresses. Enzyme cascades are how much of modern medicine, advanced materials, and green chemistry is produced. Chain several enzymes together and they perform in a single pot what would otherwise take a chemist many separate steps. The difficulty is tuning them. Enzyme ratios, pH, temperature, and buffer composition all interact at once, and the parameter space grows combinatorially. The conventional approach is a scientist running reactions one at a time for months.
Loschmidt Labs spent years building CascadeMAP, a self-driving microfluidic lab. It generates thousands of nanoliter droplet reactions and uses Bayesian optimization to converge on the best conditions without supervision. Over seven days it ran roughly 220,000 reactions across 7,400 conditions, with no one in the room.
The platform was validated on two very different cascades: (i) glycerol detection pathway (monitored by fluorescence) and (ii) 1,2,3-trichloropropane degradation pathway (monitored by label-free Raman spectroscopy), demonstrating its versatility across detection modalities and application domains.
Then we added Theorema on top of it.
If CascadeMAP is the experimental engine, Theorema is the scientist directing it. The meta layer. Our multi-agent system designed the experiments and the optimization strategy, then analyzed 11 GB of raw results across 23 campaigns, reconstructed what had happened and why, and recommended the next round. CascadeMAP ran the fast loop within each campaign. Theorema closed the slower loop between campaigns: design, interpretation, redesign. That is the work that has always required a principal investigator. And the value of that loop was concrete. Theorema saw that the search had concentrated on high-performing pockets rather than mapping the whole space, identified the variables that actually drove performance, and explained why the landscape held so many local optima. That is exactly the read needed to design a sharper next campaign.
That's what I mean by autonomous science. It is not one AI doing everything, but several roles working together. Theorema can supply all of them, or join your existing models and provide the reasoning layer at a scale, speed, and depth no human team can match.
A few months in, the system we set out to build is running real wet labs.
Authors: Michal Vašina, David Kovář, Martin Kizovsky, David Lacko, Pavel Vaňáček, Maximilian Herich, Eduard Volf, Lukas Drdla, Sona Cabalova, Pavlina Sikorova, @MichaelJirasek, Pavel Solansky, Jan Ježek, Ota Samek, @fdousek, Hynek Walner, Pavel Zemanek, Andrew deMello, Zdenek Pilat, @JiriDamborsky, Stavros Stavrakis, Stanislav Mazurenko, Zbynek Prokop
This work is a joint effort of @MasarykUni Masaryk University, St. Anne's University Hospital Brno, @ETH ETH Zurich, @CzechAcademy the Czech Academy of Sciences, and @theorema_ai Theorema. Many thanks to our partners for a fantastic collaboration.
If you're looking to accelerate your R&D program, talk to us.
(preprint link in comments.)
No scaling laws for single-cell foundation models: when bigger atlases stop teaching the model anything
In language and vision, the recipe has been simple: more data, bigger models, better performance. Single-cell biology borrowed that playbook. Foundation models for transcriptomics jumped from 1 million cells to atlases of over 100 million, on the assumption that scale would unlock the same gains. Alan DenAdel and coauthors put that assumption to the test, and the result is sobering.
Working from a 22.2-million-cell corpus, they pretrained 400 models across five architectures (from PCA and a variational autoencoder up to the Geneformer transformer) and ran 6,400 evaluation experiments. They varied not just dataset size (1% to 75%) but also diversity, using cell-type re-weighting and geometric sketching to deliberately enrich rare cell types and transcriptional states.
The finding: performance saturates almost immediately. On cell-type classification, batch integration, and perturbation prediction, most models hit their ceiling at roughly 1% of the corpus, about 200,000 cells. Beyond that, adding millions more cells changed essentially nothing. More diversity didn't help. Even spiking in genome-scale Perturb-seq data, to give the models perturbed phenotypes rather than just healthy ones, failed to move the needle. Larger models did score better overall, but they too plateaued early on data.
Two points stood out. Simple baselines (PCA, logistic regression) often matched or beat the transformers. And the strongest model, SCimilarity, won not because of size but because its contrastive training objective is aligned with the downstream task. For single-cell data, what you train on and how you frame the objective matters far more than how much you collect.
This reframes a quiet but expensive habit. In drug discovery, biotech, and any pipeline leaning on cell atlases, the instinct to keep scaling pretraining corpora may be burning compute for no return. The real leverage sits elsewhere: curating high-quality, task-relevant data and matching the training objective to the actual question you're trying to answer.
Paper: DenAdel et al., journal license | https://t.co/X7GxoxF5U5
We added >220K FDA regulatory and >1M clinical trial docs to #paperclip. All natively indexed for agents and free.
Now agents can easily reason over clinical studies w/o web search!
E.g: find all trials that were approved despite missing endpoint https://t.co/30GGqfCQmO
Took me a while to figure out what all the ESMFold2 rage was about. At first, the benchmarking data didn't look super remarkable to me but it turns there are many impressive aspects:
- Fully open source, open weights + massive ESM Atlas (1.1B structures vs 0.2B for AF3).
- SOTA performance despite no MSA use. MSA search and triangular attention were simply taken out of the base model.
- Direct consequence, super low latency inference: 1024-residue protein structure prediction in 9 secs, still outperforming prior models on antibody-antigen tasks.
- Best in class PPI and antibody-antigen results. 65% pass rate on antibody-antigen benchmarks after inference-time scaling, significant improvement over AF3.
- Tons of experimental data, in particular with lab-validated miniprotein binders plus single-chain antibodies across 5 targets in cancer and immunology. Binding affinities consistent with therapeutic activity.
- Inference-time scaling benefits PPI: Multiple seeds + selection by confidence show real gains on challenging antibody-antigen predictions, leading to comments/hypotheses that it has learned an energy-function-like behavior via the folding module.
- Base model works without MSAs, but providing them further boosts prediction quality on difficult protein-protein interaction cases.
One caveat: No true scoring for protein-protein interactions, making it harder to assess which specific residues or domains are reliably involved in binding.
Today we're announcing ESMFold2, an open scientific engine to power prediction, design, and discovery across protein biology.
The new model delivers state of the art performance on protein interactions, especially antibodies, a critical modality for therapeutics.
We have designed and validated miniprotein binders and single chain antibodies across five therapeutic targets that are important in cancer and immunology. We are seeing very high success rates, and affinities at levels consistent with therapeutic activity.
We’re also releasing an atlas of 6.8 billion proteins, and 1.1 billion predicted structures.
ESMFold2 is built on a state of the art language model that has been trained on billions of protein sequences.
A world model of protein biology emerges through language modeling.
We’ve used the techniques of mechanistic interpretability developed to understand large language models to understand the concepts ESM uses to represent proteins.
The model’s representation space has a compositional organization of features across scales, levels of complexity, and abstraction, that reflects and mirrors the understanding of protein biology developed through a century of empirical science.
This understanding emerges without prior knowledge, just from language modeling of protein sequences.
Language models are becoming a powerful substrate to understand and program biology.
The design of protein interactions is one of the most fundamental problems in biophysics, and has critical implications for the discovery of new medicines. A simple gradient based search with the model was able to discover high-affinity protein binders.
I'm excited by the potential this has to accelerate basic science and the understanding of proteins. And especially for the new avenues it opens up for therapeutic design and medicine.
For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.
We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.
This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://t.co/PK5h0mqQSo), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.