It's been exhilarating to watch this model get better and better, and I’m grateful to work with such an incredible, cross-disciplinary team across folding, binder design, and interpretability! This paper also sets a new scaling law for papers, compressing 3 papers into 1.
Today we're announcing ESMFold2, an open scientific engine to power prediction, design, and discovery across protein biology.
The new model delivers state of the art performance on protein interactions, especially antibodies, a critical modality for therapeutics.
We have designed and validated miniprotein binders and single chain antibodies across five therapeutic targets that are important in cancer and immunology. We are seeing very high success rates, and affinities at levels consistent with therapeutic activity.
We’re also releasing an atlas of 6.8 billion proteins, and 1.1 billion predicted structures.
ESMFold2 is built on a state of the art language model that has been trained on billions of protein sequences.
A world model of protein biology emerges through language modeling.
We’ve used the techniques of mechanistic interpretability developed to understand large language models to understand the concepts ESM uses to represent proteins.
The model’s representation space has a compositional organization of features across scales, levels of complexity, and abstraction, that reflects and mirrors the understanding of protein biology developed through a century of empirical science.
This understanding emerges without prior knowledge, just from language modeling of protein sequences.
Language models are becoming a powerful substrate to understand and program biology.
The design of protein interactions is one of the most fundamental problems in biophysics, and has critical implications for the discovery of new medicines. A simple gradient based search with the model was able to discover high-affinity protein binders.
I'm excited by the potential this has to accelerate basic science and the understanding of proteins. And especially for the new avenues it opens up for therapeutic design and medicine.
How to design your own PD-1 binder in 4 easy steps:
1. Download the tutorial notebook from the ESM team
2. Get a @modal API key to scale it up
3. Scaling it up, O($1000) will get you a 96 well plate of minibinders with >50% success rates on typical targets
4. Test it in the lab!
Yes! This protocol is an open-source replication of the protocol we describe in our preprint. The hit-rates we report were validated in the lab across multiple targets via biolayer interferometry (BLI).
You can read more about how we validated our binders in our preprint, including functional screens for PD-L1, epitope validation via ELISA, and even Cryo-EM structure determination!
https://t.co/xoCRtAAm33
ESMFold2 can be inverted to design new protein binders including miniproteins and scFvs! Take our protocol for a spin on @modal!
https://t.co/cJ2O5yYRox
I’m so excited about the launch of ESMFold2, ESMC, and the new ESM Atlas. This was a massive team effort, and I’m grateful to have worked with such an incredible group @biohub.
A headline result I’m especially excited about: ESMFold2 can design minibinders and antibodies with nanomolar affinity, target selectivity, and functional activity against therapeutically relevant targets.
Today, we’re sharing the full binder design protocol.
Characterizing AI-designed proteins requires quantitative biochemistry at massive scale. Enter Amplicon/Protein Bead Display (APB-Display), a fully in vitro platform that quantifies Kd's for >100,000 variants in <3 days (preprint link below!) @Stanford_ChEMH@czbiohub (1/n)
I’m so excited about the launch of ESMFold2, ESMC, and the new ESM Atlas. This was a massive team effort, and I’m grateful to have worked with such an incredible group @biohub.
A headline result I’m especially excited about: ESMFold2 can design minibinders and antibodies with nanomolar affinity, target selectivity, and functional activity against therapeutically relevant targets.
Today, we’re sharing the full binder design protocol.
A few edible plants have proteins that sit close to miraculin in the ESM Protein Atlas, so I thought I'd try extracting what protein I could from said plants and tasting it... Anyway, null result but an excuse to muck about :) Video lab notes: https://t.co/FwPzCU8R6O
We built a joint experimental and computational platform for scalable multi-modal single-cell chemical screens — profiling RNA, protein (including phospho-signaling), and chromatin accessibility responses to thousands of small molecule perturbations in parallel. https://t.co/M5x4CNLCTA
We've done a million of these deep dives into interpreting and understanding ESMC features, it's just that we don't quite know how to write about them other than to say "here are a bunch of cool observations".
We've done a million of these deep dives into interpreting and understanding ESMC features, it's just that we don't quite know how to write about them other than to say "here are a bunch of cool observations".
@ebetica@anshulkundaje It looks like that made a big difference. It found a much higher confidence pose (0.85 ipTM vs 0.81 with AF3 and 0.8 earlier with ESM2) that actually makes much more sense than the original pose and plausibly explains its MOA. Also no artifacts. Amazing work!
@alexechu_@hla_michael We did extensive evaluation of the 6B SAEs but haven't been as thorough in the 300 and 600M models. But we released SAEs for all the ESMC models at all layers in the hopes that the community takes a look! https://t.co/RssvyHZEA5
One feature of the @biohub ESM C release that I think deserves more attention is the interpretability of its latent space.
There has been a lot of discussion about whether interpretability is useful for scientific ML models. I think it can become very useful, especially when AI agents can use a model’s internal representations to reason about biology.
Here is one example of an AI agent with access to ESM C SAE features correctly interprets the loss-of-function mechanism behind a variant.
There is still a lot to improve in how AI agents use model interpretability, but this is an exciting direction for AI agents that don’t just make predictions, but inspect learned representations to generate mechanistic hypotheses.
Read more in our blog: https://t.co/QmJlCzJVe4
We've also released the SAE-enabled skills for variant interpretation, loss-of-function analysis, structural annotation, functional mechanism interpretation, and evaluation against experimental datasets via ToolUniverse @ScientistTools
Thanks to the team behind this! @GaoShanghua@_yepeng@marinkazitnik@countablyfinite@HarvardDBMI@harvardmed@Harvard@KempnerInst
Cool work showing that shifts in the ESMC latent space can be interpreted via agentic workflows to give some mechanistic insight into variant effects. I think we're just at the beginning for this type of analysis.
AI agents are learning to read @biohub protein models @GaoShanghua@AdaFang_@_yepeng
https://t.co/hPR7IYr9f0
We explored how AI agents powered by ToolUniverse @ScientistTools can interact with new ESM models
🧬 Mutation and loss-of-function analysis
Agents compare reference and mutant proteins, identify SAE features most affected by a mutation, and connect those perturbations to structural and functional consequences. The agents then relate these changes to experimental evidence, including deep mutational scanning measurements, to explain potential loss-of-function mechanisms
🧪 Functional mechanism exploration
Agents analyze protein representations to identify functional tracks associated with specific molecular activities. By linking SAE features to protein regions, structures, and annotations, the agents can generate hypotheses about how proteins carry out their functions
Check out new SAE-enabled ToolUniverse skills for variant interpretation, loss-of-function analysis, structural annotation, functional mechanism interpretation, and evaluation against experimental datasets
@HarvardDBMI@harvardmed@Harvard@broadinstitute@KempnerInst
Super excited to share what we've been working on!
ESMC/ESMFold2 show that protein language modeling learns the principles of protein biology and can be used for state-of-the-art structure prediction and design.
We also built an interactive atlas of over 6.8 billion proteins!