What if AI could explain why a protein is a kinase, not just tell you it is?
We built just that.
BioReason-Pro is a multimodal LLM that reasons about protein function — walking through domains, interactions, and biological context to make predictions you can actually evaluate.
Orthrus is now in Nature Methods(@naturemethods ) 🔥🔥🚀🚀
Paper: https://t.co/Ry55kWVkXl
Code: https://t.co/UZeEw7bCgE
The core bet: existing genomic foundation models use masked language modeling or next-token prediction imported from NLP. They work. But they're not aligned with how RNA sequence relates to function.
Orthrus uses contrastive learning with two biologically grounded augmentations: splicing isoforms (same gene, different exon inclusion) and orthologous transcripts (same gene, different species). Both pairs should be functionally similar. The model learns by agreeing across them.
Trained on 400+ mammalian species via the Zoonomia Project. Outperforms existing genomic models on 5 mRNA property prediction tasks, often beating task-specific supervised baselines with a linear head. SOTA on RNA half-life with 45 labeled examples.
The lesson isn't "more data" or "bigger model." It's that the pre-training objective has to mirror the structure of the biology. Evolution and splicing are the right teachers for mature RNA.
Huge congrats to the lead authors
@phil_fradkin@ianshi3 !
We taught a DNA model to learn its own tokenization.
It learned the genetic code with no supervision.
And outperforms Evo 2's architecture with 3x faster inference.
Great work with Arnav (@arnavshah0), Victor (@victor_ljz), Parsa (@Radii2323), Brandon (@fluorane), Sukjun (@sukjun_hwang), Bo Wang (@BoWang87), Patrick Hsu (@pdhsu), Hani Goodarzi (@genophoria) and Albert Gu (@_albertgu) 🔥
BioReason-Pro was released less than 2 weeks ago, and the response has been incredible.
Already, 1,300+ users worldwide have signed up for the portal, and 3,000+ proteins have been tested.
We’re deeply grateful for all the thoughtful and constructive feedback.
Today, we’re open-sourcing 223,000+ protein reasoning traces from BioReason-Pro on @huggingface and hopefully our work can further facilitate more research into biological reasoning!
Dataset: https://t.co/TBPwy77mIN
Try it here: https://t.co/ejt2AQ562N
@anshulkundaje articulates something the AI-for-biology practitioners (or AI-for-science for that matter) need to hear more: we are far from a stage that scale alone solves biology. Deep domain expertise and principled interpretation (as opposed to cherry-picking of results) is how we actually make progress. There's too much hubris right now in assuming one can brute-force their way through biological complexity without understanding it.
We have fixed a major inference bug in https://t.co/QJ8km69UXC, significantly improving the quality of reasoning
Give BioReason-Pro another try! And please keep the feedback coming
You can also find a guide on setting up the model locally at https://t.co/SBINSVnu7d
We @arcinstitute, @UHN, and @VectorInst recently released out BioReason-Pro, a multimodal reasoning LLM for protein function prediction, trained via SFT on synthetic reasoning traces and subsequent RL.
I had a chance to interview @BoWang87 and @genophoria on their vision for the work and what comes next. Was fun to pick their brains on the bio!
Check out the interview: https://t.co/YWslLYKFxf
I just used BioReason-Pro on a gene I am subcloning and was quite impressed. The processing time is reasonable, and the results appear accurate. That said, the functional summary could be expanded to provide more depth and context. In addition, the GO-GPT predictions section would benefit from clearer guidance and more informative explanations.
Still, amazing work! Congratulations to @BoWang87@genophoria@arcinstitute. I plan to use more in my future research.
physical systems (orbits/fluid mechanics) may look complex, but are often governed by simple equations/few parameters. can current self-supervised methods learn the underlying physics?
our new paper finds that learning in latent space may be the key!
https://t.co/cvMKzx9qrQ🧵
BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning @arcinstitute
1. BioReason-Pro introduces the first multimodal reasoning large language model specifically designed for protein function prediction, combining protein embeddings with biological context to generate interpretable reasoning traces rather than just classification labels.
2. The system integrates ESM3 protein embeddings, a GO graph encoder, and biological context including organism, domains, protein-protein interactions, and GO-GPT predictions to perform step-by-step biological reasoning from sequence to function.
3. GO-GPT, a key component, is the first autoregressive transformer for Gene Ontology prediction that captures hierarchical and cross-aspect dependencies between GO terms, achieving state-of-the-art Fwmax of 0.65-0.70 across inference strategies.
4. The model was trained on over 130,000 synthetic reasoning traces generated by GPT-5 and further optimized through reinforcement learning with Group Sequence Policy Optimization, achieving 73.6% Fmax on GO term prediction.
5. Human protein experts preferred BioReason-Pro annotations over ground truth UniProt annotations in 79% of evaluated cases, with an LLM judge score of 8/10 for functional summaries, substantially outperforming previous methods.
6. Remarkably, BioReason-Pro de novo predicted experimentally confirmed binding partners with per-residue attention localizing to exact contact residues resolved in cryo-EM structures, demonstrating genuine structural reasoning capabilities.
7. The model successfully performed structural reasoning that overrode misleading superfamily-level domain annotations, such as correctly identifying CFAP61 as a non-enzymatic scaffold despite its Rossmann-like fold that typically indicates catalytic activity.
8. For eEFSec, BioReason-Pro identified SECIS-binding protein 2 as the obligate functional partner from sequence alone, with attention concentrated on the RIFT domain surface that matches the experimentally resolved SECIS RNA binding interface in PDB 7ZJW.
9. The system maintains strong performance even for proteins with very low sequence similarity to training data, with performance degrading much more slowly than BLAST as sequence identity decreases, indicating learned generalizable reasoning rather than simple homology transfer.
10. All model weights, code, and curated datasets are released publicly, alongside precomputed predictions for over 240,000 proteins including the Human Protein Atlas, enabling broad adoption for functional annotation of uncharacterized proteins.
💻Code: https://t.co/52TcS08BmC
📜Paper: https://t.co/YrF9y6yaHW
#BioReasonPro #ProteinFunction #ComputationalBiology #Bioinformatics #MachineLearning #LLM #GeneOntology #ProteinStructure #FunctionalAnnotation #AIforScience
Our X-cell is up at @biorxiv_bioinfo !
Read our full paper at https://t.co/qdLD7mTIDy
Part of the data and the model weights will be shared soon. stay tuned!
1/ A year ago, I was skeptical about LLMs and RL in biology. Today, I’m inspired by the results and the massive potential ahead. Biomedical AI is thriving thanks to both the visionaries imagining and building the future and also those that remind us of the limitations ...
@GongDennis Thank you for checking it out!
There's very high traffic right now and some of the requests might fail🥲 You can find the Catalogue if you scroll down on the home page (we're making a fix so it autoscrolls)
What if AI could explain why a protein is a kinase, not just tell you it is?
We built just that.
BioReason-Pro is a multimodal LLM that reasons about protein function — walking through domains, interactions, and biological context to make predictions you can actually evaluate.
BioReason-Pro was trained on synthetic reasoning traces from GPT-5.
While the coding agent hype train is in full speed, the true impact of LLMs will come in biology.
This is the best time to be a Bio AI researcher. You finally have the tools to address humanity's most challenging problems.
Today we took the first step by releasing a reasoning model for proteins. Can't wait for what we released in 10 years.
Ad vitam!