David Toussaint

@DavidToussaint7

Joined February 2015

149 Following

71 Followers

584 Posts

DavidToussaint7 retweeted

Rohan Paul

@rohanpaul_ai

almost 2 years ago

Wild idea in this paper 🤯 How might we store knowledge affordably yet comprehensively? Memory³ proposes an intriguing method - compressing factual data separately. Introduces a third form of memory in addition to the implicit knowledge stored in model parameters and the short-term working memory used during inference (context key-values). 👨‍🔧 LLMs struggle with inefficient knowledge storage and retrieval, leading to high training and inference costs. The paper aims to address this by introducing a more efficient memory format. 📌 Memory3 introduces explicit memory as a third memory format for LLMs, alongside model parameters (implicit memory) and context key-values (working memory). This explicit memory is implemented as sparse attention key-values, allowing for more efficient knowledge storage and retrieval. 📌 Defines a memory hierarchy for LLMs: plain text (RAG) → explicit memory → model parameters. As you move up this hierarchy, write cost increases while read cost decreases. The goal is to optimize knowledge placement across this hierarchy based on usage frequency. 📌 Memory3's architecture involves converting reference texts into explicit memories before inference. During inference, these memories are retrieved and integrated into self-attention layers. This design allows for smaller model size while maintaining performance. 📌 The explicit memory format uses intense compression to save space. It selects only the first half of attention layers as memory layers, uses grouped query attention to reduce key-value heads, and selects only 8 out of 128 tokens for each key-value head based on attention weights. 📌 The training process involves a two-stage approach: a warmup stage without explicit memory, followed by a continual train stage with explicit memory. This approach was necessary as starting with explicit memory from the beginning rendered the memories useless. 📌 Introduces a "memory circuitry theory" to formalize the concept of knowledge in LLMs. It defines knowledge as circuits (equivalence classes of subgraphs) in the computation graph, categorizing them as specific or abstract knowledge. 📌 The Memory3 model achieved better performance than larger models and RAG models on various benchmarks, while maintaining higher decoding speed. It showed particular improvements in factuality and reduced hallucination.

rohanpaul_ai's tweet photo. Wild idea in this paper 🤯

How might we store knowledge affordably yet comprehensively? Memory³ proposes an intriguing method - compressing factual data separately. Introduces a third form of memory in addition to the implicit knowledge stored in model parameters and the short-term working memory used during inference (context key-values).

👨‍🔧 LLMs struggle with inefficient knowledge storage and retrieval, leading to high training and inference costs. The paper aims to address this by introducing a more efficient memory format.

📌 Memory3 introduces explicit memory as a third memory format for LLMs, alongside model parameters (implicit memory) and context key-values (working memory). This explicit memory is implemented as sparse attention key-values, allowing for more efficient knowledge storage and retrieval.

📌 Defines a memory hierarchy for LLMs: plain text (RAG) → explicit memory → model parameters. As you move up this hierarchy, write cost increases while read cost decreases. The goal is to optimize knowledge placement across this hierarchy based on usage frequency.

📌 Memory3's architecture involves converting reference texts into explicit memories before inference. During inference, these memories are retrieved and integrated into self-attention layers. This design allows for smaller model size while maintaining performance.

📌 The explicit memory format uses intense compression to save space. It selects only the first half of attention layers as memory layers, uses grouped query attention to reduce key-value heads, and selects only 8 out of 128 tokens for each key-value head based on attention weights.

📌 The training process involves a two-stage approach: a warmup stage without explicit memory, followed by a continual train stage with explicit memory. This approach was necessary as starting with explicit memory from the beginning rendered the memories useless.

📌 Introduces a "memory circuitry theory" to formalize the concept of knowledge in LLMs. It defines knowledge as circuits (equivalence classes of subgraphs) in the computation graph, categorizing them as specific or abstract knowledge.

📌 The Memory3 model achieved better performance than larger models and RAG models on various benchmarks, while maintaining higher decoding speed. It showed particular improvements in factuality and reduced hallucination.

215

153K

DavidToussaint7 retweeted

Olivier Azeau @oaz

about 2 years ago

Si vous voulez me voir bafouiller en direct, c'est ce soir !

227

DavidToussaint7 retweeted

Yann LeCun

@ylecun

about 2 years ago

🥁 Llama3 is out 🥁 8B and 70B models available today. 8k context length. Trained with 15 trillion tokens on a custom-built 24k GPU cluster. Great performance on various benchmarks, with Llam3-8B doing better than Llama2-70B in some cases. More versions are coming over the next few months. https://t.co/EkU9aIHdZE

ylecun's tweet photo. 🥁 Llama3 is out 🥁
8B and 70B models available today.
8k context length.
Trained with 15 trillion tokens on a custom-built 24k GPU cluster.
Great performance on various benchmarks, with Llam3-8B doing better than Llama2-70B in some cases.
More versions are coming over the next few months.

https://t.co/EkU9aIHdZE

204

818

572K

DavidToussaint7 retweeted

MTG:Toulouse @mtg_toulouse

about 2 years ago

A vos agendas ! Rendez-vous lundi 29 pour discuter UI avec une présentation d'Avalonia UI par @oaz 🤩 Inscriptions sur #meetup : https://t.co/L2MdfRqVFI #UI #OpenSource #dotnet #AvaloniaUI

355

Who to follow

Cédric Fouassier

@cfouassier

Senior software developer Administrateur du Microsoft User Group de Tours

Jérémy Jeanson

@JeremyJeanson

🇫🇷 .net fan, Microsoft MVP, Accessibility Dev, multi-platform, Agile, DevOps, mobile, geek,... born with a keyboard and a trackball

David Toussaint @DavidToussaint7

over 2 years ago

@imihalcea and more A100 ? 😅

DavidToussaint7 retweeted

MTG:Toulouse @mtg_toulouse

over 2 years ago

🚀@imihalcea plonge dans le futur de l'IA avec nous! 🤖 Sera-t-il éclipsé par une IA super intelligente en tant que speaker ? 🌟 Ne ratez pas le live pour percer ce mystère! #IA #AGI 😜🔍

mtg_toulouse's tweet photo. 🚀@imihalcea plonge dans le futur de l'IA avec nous! 🤖 Sera-t-il éclipsé par une IA super intelligente en tant que speaker ? 🌟 Ne ratez pas le live pour percer ce mystère! #IA #AGI 😜🔍 https://t.co/0QtD4U6Sud

281

DavidToussaint7 retweeted

Yann LeCun

@ylecun

over 2 years ago

L'IA peut-elle penser comme un philosophe. Aujourd'hui, non. En cela je suis d'accord avec @Enthoven_R. Mais y parviendra-t-elle demain? C'est très probable.

184

102K

DavidToussaint7 retweeted

Yann LeCun

@ylecun

over 2 years ago

* Language is low bandwidth: less than 12 bytes/second. A person can read 270 words/minutes, or 4.5 words/second, which is 12 bytes/s (assuming 2 bytes per token and 0.75 words per token). A modern LLM is typically trained with 1x10^13 two-byte tokens, which is 2x10^13 bytes. This would take about 100,000 years for a person to read (at 12 hours a day). * Vision is much higher bandwidth: about 20MB/s. Each of the two optical nerves has 1 million nerve fibers, each carrying about 10 bytes per second. A 4 year-old child has been awake a total 16,000 hours, which translates into 1x10^15 bytes. In other words: - The data bandwidth of visual perception is roughly 16 million times higher than the data bandwidth of written (or spoken) language. - In a mere 4 years, a child has seen 50 times more data than the biggest LLMs trained on all the text publicly available on the internet. This tells us three things: 1. Yes, text is redundant, and visual signals in the optical nerves are even more redundant (despite being 100x compressed versions of the photoreceptor outputs in the retina). But redundancy in data is *precisely* what we need for Self-Supervised Learning to capture the structure of the data. The more redundancy, the better for SSL. 2. Most of human knowledge (and almost all of animal knowledge) comes from our sensory experience of the physical world. Language is the icing on the cake. We need the cake to support the icing. 3. There is *absolutely no way in hell* we will ever reach human-level AI without getting machines to learn from high-bandwidth sensory inputs, such as vision. Yes, humans can get smart without vision, even pretty smart without vision and audition. But not without touch. Touch is pretty high bandwidth, too.

550

DavidToussaint7 retweeted

MTG:Toulouse @mtg_toulouse

over 2 years ago

🎉 Le prochain meetup aura lieu mardi 19 mars, et on se retrouve pour deux sessions : IA 🤖 et Monads 🥳 !Vous pouvez réserver votre soirée ✨ Détails et inscriptions à venir très vite.

136

David Toussaint @DavidToussaint7

over 2 years ago

@JMDeruty @imihalcea Cela nous permettra de naviguer entre le respect de la précision historique et l'aspiration à une représentation plus inclusive et diversifiée, sans pour autant compromettre l'un ou l'autre.

David Toussaint @DavidToussaint7

over 2 years ago

@JMDeruty @imihalcea Il est donc impératif de rester critiques envers les modèles et leur utilisation, tout en continuant à éduquer sur leurs potentiels risques et biais.

DavidToussaint7 retweeted

Yann LeCun

@ylecun

over 2 years ago

Like @AndrewYNg, I have observed a definite shift in the prevalent discourse about AI at Davos: - Few people still talk about existential risk, and few people believe that current technology, even scaled up, will present an existential risk. - Everyone agrees that open source AI platforms are a good thing for cultural and linguistic diversity, local sovereignty, education, science, and businesses. - Everyone agrees that regulating AI-powered products can be useful in certain areas (health, transportation, etc). - The debate is still on for whether AI research and development and open source AI platforms should be regulated. - Many people are worried about a new flood of AI-powered political disinformation. Industry-wide standards for content authentication are needed. - AI has become the most talked-about topic.

259

420

574K

David Toussaint

@DavidToussaint7

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users