Enes TAYLAN @enestaylan - Twitter Profile

over 1 year ago

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization 1. This study explores the effectiveness of subword tokenization methods—Byte-Pair Encoding (BPE), WordPiece, and SentencePiece��in representing protein sequences, highlighting how their linguistic origins limit their performance in biological contexts. 2. Key findings show that vocabulary size significantly impacts tokenization performance. Smaller vocabularies ensure more shared tokens and better domain boundary preservation, while larger vocabularies lead to greater divergence and less effective segmentation. 3. BPE demonstrates superior contextual specialization and marginally better domain boundary preservation in smaller vocabularies, whereas SentencePiece excels in encoding efficiency, achieving lower fertility scores. 4. The analysis of linguistic laws reveals partial adherence to Zipf’s and Brevity laws, but substantial deviations from Menzerath’s law, suggesting protein sequences might follow unique organizational principles distinct from natural languages. 5. All tokenization methods struggle to maintain protein domain integrity, especially with increasing vocabulary sizes, underscoring the need for protein-specific tokenization approaches that respect biological sequence structures. 6. The results advocate for the development of specialized tokenizers tailored to the complexities of protein sequences, moving beyond adaptations of natural language tokenization strategies. @enestaylan @suyunuCS 💻Code: https://t.co/vosafzIIXa 📜Paper: https://t.co/U6Bpj5EQJI #Bioinformatics #ProteinSequence #NaturalLanguageProcessing #LinguisticLaws #SubwordTokenization

BiologyAIDaily's tweet photo. Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization

1. This study explores the effectiveness of subword tokenization methods—Byte-Pair Encoding (BPE), WordPiece, and SentencePiece��in representing protein sequences, highlighting how their linguistic origins limit their performance in biological contexts.

2. Key findings show that vocabulary size significantly impacts tokenization performance. Smaller vocabularies ensure more shared tokens and better domain boundary preservation, while larger vocabularies lead to greater divergence and less effective segmentation.

3. BPE demonstrates superior contextual specialization and marginally better domain boundary preservation in smaller vocabularies, whereas SentencePiece excels in encoding efficiency, achieving lower fertility scores.

4. The analysis of linguistic laws reveals partial adherence to Zipf’s and Brevity laws, but substantial deviations from Menzerath’s law, suggesting protein sequences might follow unique organizational principles distinct from natural languages.

5. All tokenization methods struggle to maintain protein domain integrity, especially with increasing vocabulary sizes, underscoring the need for protein-specific tokenization approaches that respect biological sequence structures.

6. The results advocate for the development of specialized tokenizers tailored to the complexities of protein sequences, moving beyond adaptations of natural language tokenization strategies.

@enestaylan @suyunuCS
💻Code: https://t.co/vosafzIIXa
📜Paper: https://t.co/U6Bpj5EQJI

#Bioinformatics #ProteinSequence #NaturalLanguageProcessing #LinguisticLaws #SubwordTokenization

0

6

1

2

2K

enestaylan retweeted

Alex Rives

@alexrives

almost 2 years ago

We have trained ESM3 and we're excited to introduce EvolutionaryScale. ESM3 is a generative language model for programming biology. In experiments, we found ESM3 can simulate 500M years of evolution to generate new fluorescent proteins. Read more: https://t.co/iAC3lkj0iV

135

3K

787

1K

2M

Enes TAYLAN @enestaylan

over 2 years ago

Yep it is just curve fitting and that’s the beauty of it. With just curve fitting in next token prediction, amazing capabilities emerge. Search engine analogy is inherently wrong.

François Chollet

@fchollet

over 2 years ago

The idea that you can build general cognitive abilities by fitting a curve on "everything there is to know" is akin to building a search engine by listing "every query anyone might ever make". The world changes every day. The point of intelligence is to adapt to that change.

61

2K

222

236

270K

0

2

0

2K

enestaylan retweeted

Andrej Karpathy

@karpathy

over 2 years ago

With many 🧩 dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System. E.g. today it orchestrates: - Input & Output across modalities (text, audio, vision) - Code interpreter, ability to write & run programs - Browser / internet access - Embeddings database for files and internal memory storage & retrieval A lot of computing concepts carry over. Currently we have single-threaded execution running at ~10Hz (tok/s) and enjoy looking at the assembly-level execution traces stream by. Concepts from computer security carry over, with attacks, defenses and emerging vulnerabilities. I also like the nearest neighbor analogy of "Operating System" because the industry is starting to shape up similar: Windows, OS X, and Linux <-> GPT, PaLM, Claude, and Llama/Mistral(?:)). An OS comes with default apps but has an app store. Most apps can be adapted to multiple platforms. TLDR looking at LLMs as chatbots is the same as looking at early computers as calculators. We're seeing an emergence of a whole new computing paradigm, and it is very early.

karpathy's tweet photo. With many 🧩 dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System. E.g. today it orchestrates:

- Input & Output across modalities (text, audio, vision)
- Code interpreter, ability to write & run programs
- Browser / internet access
- Embeddings database for files and internal memory storage & retrieval

A lot of computing concepts carry over. Currently we have single-threaded execution running at ~10Hz (tok/s) and enjoy looking at the assembly-level execution traces stream by. Concepts from computer security carry over, with attacks, defenses and emerging vulnerabilities.

I also like the nearest neighbor analogy of "Operating System" because the industry is starting to shape up similar:
Windows, OS X, and Linux <-> GPT, PaLM, Claude, and Llama/Mistral(?:)).
An OS comes with default apps but has an app store.
Most apps can be adapted to multiple platforms.

TLDR looking at LLMs as chatbots is the same as looking at early computers as calculators. We're seeing an emergence of a whole new computing paradigm, and it is very early.

295

9K

2K

4K

2M

Who to follow

Ali Taylan Öztaylan

@oztaylan

AK Parti Balıkesir Milletvekili / AK Parti Genel Sekreter Yardımcısı / KİT Komisyonu Üyesi / Mali Dostluk Grubu Başkanı

Öz Taşıma İş

@oztasimais

Öz Taşıma İş Sendikası Yukarı Öveçler Mah. 1290. Sok. No: 10 Çankaya/Ankara Tel: (0312) 472 0143

SAMET ÖZDEMİR

@ozdemirsamet

AK Parti Ordu İl Başkanı

enestaylan retweeted

Andrej Karpathy

@karpathy

about 3 years ago

Around 5 years ago we were very proud of these state of the art results in image generation, trained on 32x32 "images" of CIFAR-10. You can kind of make out little wheel shapes, car/plane parts, and organic structures and textures. Pretty cool right

karpathy's tweet photo. Around 5 years ago we were very proud of these state of the art results in image generation, trained on 32x32 "images" of CIFAR-10. You can kind of make out little wheel shapes, car/plane parts, and organic structures and textures. Pretty cool right https://t.co/1mydX3tXGr

24

2K

123

539K

enestaylan retweeted

Raphaël Millière @raphaelmilliere

over 3 years ago

We will host a pre-conference debate on Friday, March 24th on the question: "Do Language Models Need Sensory Grounding for Meaning and Understanding?" The debate will feature @Jake_Browning00, @davidchalmers42, @LakeBrenden, @ylecun, @glupyan & Ellie Pavlick (@BrownCSDept).

5

59

10

35K

Enes TAYLAN @enestaylan

over 3 years ago

@chelseabfinn @_eric_mitchell_ @yoonholeee @SashaKhazatsky @chrmanning Hello, you can try the DetectGPT on paper abstracts maybe https://t.co/pHK0J1ldwn https://t.co/Pnu2W6YSDR

0

43

Enes TAYLAN @enestaylan

over 3 years ago

@jerome_massot @chelseabfinn I think no. It just shows average writing skills of a person is worse than LLMs in LLMs’ view of language. That’s also why they can approximate direction of derivatives with another LLM.

0

34

enestaylan retweeted

Lex Fridman

@lexfridman

almost 4 years ago

Here's my conversation with John Carmack (@ID_AA_Carmack), legendary programmer & engineer. At over 5 hours, this is officially the longest conversation I've had on the podcast, and we can talk many more times. This was really fun and a huge honor for me. https://t.co/RuWzACGeGm

lexfridman's tweet photo. Here's my conversation with John Carmack (@ID_AA_Carmack), legendary programmer & engineer. At over 5 hours, this is officially the longest conversation I've had on the podcast, and we can talk many more times. This was really fun and a huge honor for me. https://t.co/RuWzACGeGm https://t.co/ul4sDQvexZ

178

3K

443

324

0

Enes TAYLAN @enestaylan

almost 4 years ago

https://t.co/spnAqBYdLU

0

1

0

enestaylan retweeted

Özgür Özdamar @OzgurOzdamar

about 4 years ago

Bölümümüz hocalarının düzenlediği etkinliğe bekleriz. Sıkıcı bir etkinlik olmayacağı bence garanti 😉

0

52

2

0

Enes TAYLAN @enestaylan

over 4 years ago

2019 https://t.co/UFhgF2LkOL

0

enestaylan retweeted

AI at Meta

@AIatMeta

about 5 years ago

In a paper published today in @PNASNews researchers at FAIR find that biological structure and function emerge in representations of language models trained on massive databases of protein sequences. Read the paper: https://t.co/EBd0D2XF0O

AIatMeta's tweet photo. In a paper published today in @PNASNews researchers at FAIR find that biological structure and function emerge in representations of language models trained on massive databases of protein sequences. Read the paper: https://t.co/EBd0D2XF0O https://t.co/E02HKCfkQR

5

308

72

40

0

enestaylan retweeted

Yann LeCun

@ylecun

over 5 years ago

Language is an imperfect, incomplete, and low-bandwidth serialization protocol for the internal data structures we call thoughts.

179

5K

865

679

0

Enes TAYLAN @enestaylan

over 5 years ago

“When your scope for action is greatest, the knowledge on which you can base this action is always at a minimum. When your knowledge is greatest, the scope for action has often disappeared.” Henry Kissinger

0

1

0

enestaylan retweeted

Özgür Özdamar @OzgurOzdamar

almost 6 years ago

Bilkent Uluslararası İlişkiler Bölümü hesap açmış. Kitaplar, makaleler, konferanslar, toplantılardan haberdar olmak için şuradan takip edelim: @BilkentIRDept

1

61

6

0