Yeah exactly.
I have been a fan of protein language models since 2021, then actually realized that protein language is not that interpretable, but once you have structure, it starts to be.
Design based 100% on sequence is extremely way harder vs just looking at folds and computing RMSDs + analyzing sequences. Structure is the semantics of protein language models.
@nanogenomic@proteinrosh Good to see mindset like this! I’m glad this discussion was productive and led to additional analysis!
The only thing I still miss is expected hit rate above/below some ipSAE/pae metric threshold so the result is more interpretable. I guess that’s the next step
Be aware of this kind of reasoning. Biology is extremely humbling.
I once analyzed 2 different structural models. One of them was overconfident - everything failed to be solubly expressed. Another model had "mid" metrics, but everything was working perfectly.
Everybody is gangsta in in silico metrics, until it has to be validated in the lab.
Hmm, here are the sources that can be useful for validation of "oracle" for the de-novo design.
https://t.co/6LhNoNIdIY
https://t.co/l88PttbJ71
Both datasets are pretty easy to analyze and very valuable.
It gets more tricky if we speak about how to validate the efficacy of design campaigns, but it's a completely different debate.
Most of the researchers/founders don't want to ship actual drugs. They want to show their method/platform is crushing it.
Those problems are good model problems because many people know/tried to solve them, so it's easy to compare vs different methods.
If somebody cares about shipping things, applies existing methods to low-hanging fruit. To be fair I think a lot of Protein design race is an irrational chase for new de-novo methods, while we can already solve an enormous number of problems by applying existing methods to "low hanging fruits".
@nanogenomic Now it’s much more clearer!
That clearly explains left and mid panel!
Can you explain how exactly C was validated?
I don’t see clearly the connection between exp result and prediction.
Well, I strongly disagree with this because it leads to false conclusions as it’s not grounded in the data.
You have plenty of data from campaigns such as protein database or David baker lab paper.
You can just use this data. This is a perfect data for this purpose - actually that’s how @TimothyPJenkins team validated ipSAE as successor of iPAE.
Hmm, I think I would rather look at the experimental success rate correlation rather than purely in sillico metrics.
I’ve tested for some targets 2 different folding tools. Rfdiffusion- mpnn-> folding.
First - in sillico hit rate had nothing to do with correlation of success. One model was just hallucinating more often.
Second - I think folding tool performance depends on target. For example esmfold2 crushes Chai1 on influenza like structures, but is worse at RSV.
Before I run any campaign I always look at optimal model and hyperparams for a given structure family. Actually I prefer having few models. One as screening, second as final validation.
That's really interesting, but what may be surprsing - I think the value is bigger in protein design rather than in pure structure prediction.
Everytime, I pick final candidates for testing I bootstrap confidence intervals of structural metrics. I don't pick only the best candidates - I pick those with low variance of scores.
Hey guys (cc: @proteinrosh@TomSercu@sokrypton@biohub@AyushmanMallick) - to dig into this I used a protein I’ve worked on, RSV prefusion F, which has well-characterised flexible loops (the site-Ø epitope).
Hypothesis: the model is generally well-calibrated, but overconfident specifically on intrinsically flexible regions.
I measured each residue’s intrinsic flexibility as the Cα-lDDT between 3 different PDB structures (and between chains within each PDB), then folded the same sequences with ESMFold2 (with MSA) and compared pLDDT vs actual lDDT (vs the experimental structures) - calibration error on the y-axis, intrinsic flexibility on the x-axis.
Result: pLDDT tracks true accuracy well overall (r≈0.83), but the more flexible the region, the more overconfident the model gets - overconfidence averages +7.6 lDDT pts on flexible residues vs +1.4 on rigid ones.
This is easy to miss in standard evals: well-ordered residues vastly outnumber flexible ones, so aggregate calibration looks fine while the flexible-region overconfidence gets averaged away - a flexibility-class imbalance.
@proteinrosh@DdelAlamo Oh nice, good to see that works!
I'm doing something similar, but each set of critics was trained on another modality (for example solubility, thermostability). The key problem is how to tune the ensemble weights as models tend to have different scales.
@DdelAlamo Yes, there is, but it's getting deprioritized over other metrics. You can see that by reading early David Baker work and comparing it with RFdiffusion.
Here is a nice paper comparing multiple different metrics, you should find SASA there
https://t.co/h1rJAUEgHu