We’re excited to share the full binder design protocol. Check it out here: https://t.co/AtkipkiYtS.
The notebook includes support for @modal to easily scale up binder generation.
Give it a try and let us know how it works!
You can read more about ESMFold2, ESMC, ESM Atlas, and the full results in the paper here: https://t.co/M3rt00pU8Z.
Hugging Face is the home for AI & ML across every domain, including biomedical!
The @NIH just added the @huggingface Hub to its official list of Generalist Repositories for data sharing.
NIH-funded? You can point to the Hub in your data sharing plan 🤗
on hugging science: mattergen ⚛️
generative ai for materials. you give it a target property, it proposes novel inorganic crystal structures to match. inverse design instead of screen-and-pray.
built for energy, catalysis and functional materials research. weights on the hub.
today was a massive day for protein engineering.
esmfold2 dropped—next gen of the esm series, fully open on @huggingscience. 1.1 billion predicted structures, 6.8 billion sequences. 800m more entries than the alphafold db, and reportedly edging out alphafold3 on protein complexes, including antibody–antigen binding.
alongside it: the new esm atlas. a huge expansion of known protein space, heavy on metagenomic sequences from soil, ocean, and the parts of biology that have been least characterised (until now!!)
and if that weren't enough, litefold dropped the fineweb of proteins, so every major protein database (pdb included) aggregated, cleaned, and made plug-and-play in one place.
these are the releases that push the whole field forward, and the pace of open science right now is almost motion-sickness inducing
all of it on https://t.co/T4l4r1lDz0 (and ofc @huggingface)
What can a DNA foundation model actually do?
We got this question a lot after releasing Carbon, our new DNA model. Here are three things it does.
🧬 All live in our demo: https://t.co/8NtRlHQG3H
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇
https://t.co/MSPMwnbhVt
@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.
1/6🧵
Introducing Carbon 🧬 a family of open generative DNA foundation models. Carbon-3B matches Evo2-7B while running 250x faster at inference. It can generate new DNA sequences and score the functional impact of mutations, zero-shot.
We borrowed a lot from how modern LLMs are trained, but DNA isn't language. Genomes are noisy, redundant, and shaped by evolution rather than communication. So we adjusted the recipe:
Tokenizer. Most genomic models tokenize at the nucleotide/character level, which blows up sequence length. BPE is the obvious LLM-style fix, but it doesn't behave well on DNA. We use deterministic 6-mer tokens (one token = 6 nucleotides): 6× shorter sequences and cheaper attention.
Training loss. With 6-mer tokens, cross-entropy scores a prediction that gets 5/6 nucleotides right the same as one that's completely wrong. This gets brittle late in training and produces loss spikes. We switch mid-training to a more flexible factorized loss (FNS).
Data. Genomes are mostly sparse, repetitive background. We curate down to a staged functional DNA + mRNA mixture, with every ratio chosen by ablation, like mixing a web corpus, but for biology.
We're releasing the models, training data, training code, evaluation suite, and a demo to play with.
More details in the technical report: https://t.co/RMzFmTAhhT
Demo to play with the model, with a biology primer for our ML friends ;) https://t.co/IcOQq7GKF4
Super happy to have this one out. A clean organized up-to-date view of all the science resources (chemistry, biology, physics, materials, math) people have been sharing on the Hugging Face hub: datasets, blogs, models and more
AI for Science: this is the new frontier for AI and making progress here will impact all of humanity.
The new Hugging Science site is here to make sure it is open and accessible to every researcher!
Datasets, models, leaderboards, blogs, guides:
https://t.co/31Ryr8K2VE
What OpenMed contributes to https://t.co/dc2nOaoYLt:
→ 1,000+ clinical and medical NER models
→ PII detection in 9 languages
→ SuperClinical: #1 on the PII Masking leaderboard
→ Privacy-filter-nemotron (OpenAI base, retrained for medical)
Apache 2.0. On-prem deployable.
🤗🤗🤗introducing Hugging Science -- the home of AI for science 🤗🤗🤗
open models and datasets are the powerhouse of science (see the PDB), but finding the models and data you actually need for your breakthrough is hard af
you shouldn't need to scrape arxiv, own your own wetlab, fight a custom HDF5 parser, build a fusion stellarator, and beg for compute before you've trained a single epoch
so we're changing that
we've put all the best science on @huggingface in one place:
- 78GB of genomics data
- 11TB of PDE simulations
- 100M cell profiles
- 9T DNA base pairs
- 13M molecular trajectories
- 400k medical QA pairs
and much more, all open, and all ready for training (+ you can also now filter and search by domain, task, and keyword)
we've put together all the biggest releases from our partners at NASA, Google, OpenAI, Meta FAIR, Arc Institute, Ginkgo, SandboxAQ, Proxima Fusion, NVIDIA, Ai2, OpenADMET, InstaDeep, Future House, Polymathic AI, LeMaterial, Earth Species Project, Merck, and Eve Bio
if you're not sure where you fit in -- work on open challenges for problems that matter: including fusion stellarator design, ADMET, antibody developability, multilingual medicine, catalysis and materials, and scientific reasoning.
we're already changing how science gets done:
a fusion startup needed a benchmark for stellarator plasma confinement that didn't exist. @proximafusion shipped ConStellaration on Hugging Science: a leaderboard, dataset, and eval metrics, all in one place.
a drug discovery team wanted to predict hPXR induction. OpenADMET put up a blind challenge: 11,000+ compounds assayed at Octant, 513 held out, two tracks (pEC50 + structure). Anyone in the world can train and submit.
an antibody team at @Ginkgo released GDPa1, a developability dataset for stability, manufacturability, and immunogenicity prediction, with a live leaderboard scoring every submission.
if you know a problem the ML community should be working on, let us know. make a challenge! this is about putting all the tools for solving science in one place. so we can hillclimb!
→ https://t.co/T4l4r1lDz0
We taught a DNA model to learn its own tokenization.
It learned the genetic code with no supervision.
And outperforms Evo 2's architecture with 3x faster inference.
Great work with Arnav (@arnavshah0), Victor (@victor_ljz), Parsa (@Radii2323), Brandon (@fluorane), Sukjun (@sukjun_hwang), Bo Wang (@BoWang87), Patrick Hsu (@pdhsu), Hani Goodarzi (@genophoria) and Albert Gu (@_albertgu) 🔥
We released Gemma 4 last week, and seeing the community's response has been amazing! 🚀
Honored to lead the vision efforts in which we made huge performance leaps from Gemma 3, I wanted to help you make the most of the new capabilities. Deep dive 🧵