Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization
1. This study explores the effectiveness of subword tokenization methods—Byte-Pair Encoding (BPE), WordPiece, and SentencePiece���in representing protein sequences, highlighting how their linguistic origins limit their performance in biological contexts.
2. Key findings show that vocabulary size significantly impacts tokenization performance. Smaller vocabularies ensure more shared tokens and better domain boundary preservation, while larger vocabularies lead to greater divergence and less effective segmentation.
3. BPE demonstrates superior contextual specialization and marginally better domain boundary preservation in smaller vocabularies, whereas SentencePiece excels in encoding efficiency, achieving lower fertility scores.
4. The analysis of linguistic laws reveals partial adherence to Zipf’s and Brevity laws, but substantial deviations from Menzerath’s law, suggesting protein sequences might follow unique organizational principles distinct from natural languages.
5. All tokenization methods struggle to maintain protein domain integrity, especially with increasing vocabulary sizes, underscoring the need for protein-specific tokenization approaches that respect biological sequence structures.
6. The results advocate for the development of specialized tokenizers tailored to the complexities of protein sequences, moving beyond adaptations of natural language tokenization strategies.
@enestaylan @suyunuCS
💻Code: https://t.co/vosafzIIXa
📜Paper: https://t.co/U6Bpj5EQJI
#Bioinformatics #ProteinSequence #NaturalLanguageProcessing #LinguisticLaws #SubwordTokenization
We have trained ESM3 and we're excited to introduce EvolutionaryScale.
ESM3 is a generative language model for programming biology. In experiments, we found ESM3 can simulate 500M years of evolution to generate new fluorescent proteins.
Read more: https://t.co/iAC3lkj0iV
Yep it is just curve fitting and that’s the beauty of it. With just curve fitting in next token prediction, amazing capabilities emerge. Search engine analogy is inherently wrong.
The idea that you can build general cognitive abilities by fitting a curve on "everything there is to know" is akin to building a search engine by listing "every query anyone might ever make". The world changes every day. The point of intelligence is to adapt to that change.
With many 🧩 dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System. E.g. today it orchestrates:
- Input & Output across modalities (text, audio, vision)
- Code interpreter, ability to write & run programs
- Browser / internet access
- Embeddings database for files and internal memory storage & retrieval
A lot of computing concepts carry over. Currently we have single-threaded execution running at ~10Hz (tok/s) and enjoy looking at the assembly-level execution traces stream by. Concepts from computer security carry over, with attacks, defenses and emerging vulnerabilities.
I also like the nearest neighbor analogy of "Operating System" because the industry is starting to shape up similar:
Windows, OS X, and Linux <-> GPT, PaLM, Claude, and Llama/Mistral(?:)).
An OS comes with default apps but has an app store.
Most apps can be adapted to multiple platforms.
TLDR looking at LLMs as chatbots is the same as looking at early computers as calculators. We're seeing an emergence of a whole new computing paradigm, and it is very early.
Around 5 years ago we were very proud of these state of the art results in image generation, trained on 32x32 "images" of CIFAR-10. You can kind of make out little wheel shapes, car/plane parts, and organic structures and textures. Pretty cool right
We will host a pre-conference debate on Friday, March 24th on the question: "Do Language Models Need Sensory Grounding for Meaning and Understanding?"
The debate will feature @Jake_Browning00, @davidchalmers42, @LakeBrenden, @ylecun, @glupyan & Ellie Pavlick (@BrownCSDept).
@jerome_massot@chelseabfinn I think no. It just shows average writing skills of a person is worse than LLMs in LLMs’ view of language. That’s also why they can approximate direction of derivatives with another LLM.
Here's my conversation with John Carmack (@ID_AA_Carmack), legendary programmer & engineer. At over 5 hours, this is officially the longest conversation I've had on the podcast, and we can talk many more times. This was really fun and a huge honor for me. https://t.co/RuWzACGeGm
In a paper published today in @PNASNews researchers at FAIR find that biological structure and function emerge in representations of language models trained on massive databases of protein sequences. Read the paper: https://t.co/EBd0D2XF0O
“When your scope for action is greatest, the knowledge on which you can base this action is always at a minimum. When your knowledge is greatest, the scope for action has often disappeared.”
Henry Kissinger
Bilkent Uluslararası İlişkiler Bölümü hesap açmış. Kitaplar, makaleler, konferanslar, toplantılardan haberdar olmak için şuradan takip edelim: @BilkentIRDept