So representation matters: amino-acid tokens lose chemistry that SMILES/chemical language models can retain.
Thus: smaller domain models will work, but only if the input format preserves the molecular information the assay actually sees.
More: https://t.co/cFbBBwcNXl
PepBERT is a useful counterpoint to the 'bigger model wins' narrative.
Peptides are not the open internet. They are short, structured biological sequences where the right training distribution matters more than brute scale.
And while PepBERT is a good sequence-model result, PeptideMTR gets closer to the real chemistry problem. Therapeutic peptides are often not just linear amino-acid strings. They include N/C-terminal mods, cyclization, non-canonical residues, conjugates, lipidation, etc.