Pietro Mascheroni

@Pi_Mas

Post Doc @ gBDS, Boehringer Ingelheim | Predictive modeling | Machine Learning | AI in Medicine. Views are my own.

Post Doc, Boehringer Ingelheim

Joined March 2012

402 Following

109 Followers

122 Posts

Pietro Mascheroni @Pi_Mas

almost 2 years ago

@PRobertImmodels you also survived the winter in Braunschweig! This is not a small thing :D

Pi_Mas retweeted

Mistral AI

@MistralAI

about 2 years ago

magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%https://t.co/2UepcMGLGd%3A1337%2Fannounce&tr=http%3A%2F%https://t.co/OdtBUsbeV5%3A1337%2Fannounce

251

749

Pietro Mascheroni @Pi_Mas

over 2 years ago

@emollick Well, we are just using language models are reasoning engines... what could go wrong? 😅

Pietro Mascheroni @Pi_Mas

over 2 years ago

@emollick I am with @ylecun here, if I understand him correctly. We need a paradigmatic shift in terms of architecture: current LLM architectures are doomed to commit errors, no easy way out - no matter how much we optimize pipelines.

134

Who to follow

Thomas Hillen

@thillen17

May the Math be with you!

SMB Mathematical Oncology Subgroup

@SMBMathOnco

Welcome to the official page for the SMB Mathematical Oncology Subgroup! Using math to better understand & predict cancer initiation, progression, and treatment

Liz Fedak, PhD

@lizfedakmath

Dynamical systems analyst. Pro-West. Patriot. Special agent of the feminist government in exile.

Pi_Mas retweeted

Bindu Reddy

@bindureddy

over 2 years ago

Good Quality Data, Not Compute, Is LLM Gold GPU shortage has somewhat eased up and a number of companies including Amazon, Google and Micosoft are trying to compete with Nvidia with their own LLM friendly chips We are also seeing a big trend towards smaller models being as performant as large models. Mistral's 7B outperforms 13B and Llama-2 70B can be tuned to get GPT 3.5 (180B)'s performance If I have to make a bet, I would say a MOE (mixture of experts) architecture of instruct tuned open-source models where you have an ensemble of models, each an "expert" at a particular type of task can potentially achieve GPT-4 like performance. If each of the models < 100B in size, then they would also be accessible for research and mass market use (i.e. serve-able by the GPU poor). Serving super large models is not only expensive but is also cumbersome. Given that, < 100B models such as Llama-2 have shown promise, I suspect the GPU crunch will soon be over. Both because of multiple players coming into the market and by more efficient models. The next thing to consider is data, how much more performance can we get out these LLMs by expanding their training datasets? Can OpenAI go into a infinite loop and successively train more and more powerful and performant models? LLMs plateau in performance once they've "used up" the information in their training set. This fundamentally means that we can run out of "training data". In traditional machine learning, this is reflected in the fact that model performance doesn't improve even if you train with larger larger datasets, as long as the sample you are training on, is robust and is reflective of the underlying data distribution Some believe that we already saturated available training data. This means that GPT 5, 6 and 7 will look more like GPT 4, unless we develop some new techniques Having said that, there is a lot of data being created constantly by humans and now LLMs :). So in some sense we may never run out of new data for LLM training. Assuming we will continue to have new data to train LLMs, do we need bigger and bigger LLMs as we generate more and more training data? Will the LLM's reasoning skills improve because it's being trained on bigger and bigger datasets, or is just that their knowledgeBase improves? The Chinchilla Training Law, proposed by Google, addresses LLM size vs. data and challenges traditional scaling laws by advocating for a ~21:1 ratio of training tokens to parameter size, instead of the conventional ~1:1 ratio, optimizing the balance between model size, training data, and compute budget. This law posits that many models were "massively oversized and massively undertrained," suggesting a pathway towards more efficiently trained, cost-effective Large Language Models. They recommend LLMs of size 70 billion parameters should be trained with 1,400 billion (1.4 trillion) tokens to achieve data-optimal training. For example Llama-2 70B is trained on 2T tokens! So more data, doesn't necessarily mean bigger LLMs. The next question is around data-quality. Again, like standard issue ML, focusing on data quality over quantity can vastly improve LLM performance. Gunasekar, et al. in their paper "Textbooks Are All You Need". The team trained a small model of just 1.3B parameters but used high quality data of filtered code, textbooks, and GPT-3.5-generated data. High quality data dramatically improves the learning efficiency of language models for code as they provide clear, self-contained and instructive examples. Fine-tuning LLM models yield excellent results, but only with the utilization of high-quality domain-specific data. The majority of the effort is spent on curating the data and continually refining it, based on the performance of the LLM. Data matters, but high quality data matters more. When it comes to specific enterprise AI tasks, an open-source model fine-tuned on a high quality dataset has equivalent performance to GPT-4. So the real challenge is high quality data. Enterprises have a lot of high quality training data for specific uses - example: a Q/A model on their knowledgebase. For example, their customer support team by answering customer queries, has basically curated a large supervised training dataset. Another source of high quality data is the human preference data. Humans can review LLM responses and provide feedback to the LLM. This can be used in further refine a model. Consumer services like ChatGPT are collecting a lot of human feedback through their chat interfaces High quality data can in-turn be generated by LLMs. "self-instruct" is a method where you can generate training data for an LLM from itself. Alternatively you can use one LLM (say GPT-4) to generate a high quality training dataset to fine-tune another LLM. Two instruction-tuned LLaMA models were compared, one fine-tuned on data generated by GPT-4 and the other on data generated by GPT-3. The model fine-tuned with GPT-4 generated data performed substantially better in the "Helpfulness" criterion, showcasing the utility of high-quality data generated by LLMs for fine-tuning High-quality small datasets can be very useful for fine-tunes and you can also employ traditional human labelling techniques to generate those datasets. In summary, you can fine-tune a small LLM (< 100B) for a particular task, if you have a high quality dataset Good general purposes LLMs however are a different story and may still require a lot of data and training. However, it's not clear if just throwing more data at the problem will improve LLM reasoning dramatically. It's also very likely that Auto-regressive LLMs while immensely useful for many business and consumer use-cases aren't going to get us to AGI, simply by "brute-forcing it" One more reason for the Doomers to simmer down and let builders build :)

bindureddy's tweet photo. Good Quality Data, Not Compute, Is LLM Gold

GPU shortage has somewhat eased up and a number of companies including Amazon, Google and Micosoft are trying to compete with Nvidia with their own LLM friendly chips

We are also seeing a big trend towards smaller models being as performant as large models. Mistral's 7B outperforms 13B and Llama-2 70B can be tuned to get GPT 3.5 (180B)'s performance

If I have to make a bet, I would say a MOE (mixture of experts) architecture of instruct tuned open-source models where you have an ensemble of models, each an "expert" at a particular type of task can potentially achieve GPT-4 like performance. If each of the models < 100B in size, then they would also be accessible for research and mass market use (i.e. serve-able by the GPU poor).

Serving super large models is not only expensive but is also cumbersome. Given that, < 100B models such as Llama-2 have shown promise, I suspect the GPU crunch will soon be over. Both because of multiple players coming into the market and by more efficient models.

The next thing to consider is data, how much more performance can we get out these LLMs by expanding their training datasets?

Can OpenAI go into a infinite loop and successively train more and more powerful and performant models?

LLMs plateau in performance once they've "used up" the information in their training set. This fundamentally means that we can run out of "training data".

In traditional machine learning, this is reflected in the fact that model performance doesn't improve even if you train with larger larger datasets, as long as the sample you are training on, is robust and is reflective of the underlying data distribution

Some believe that we already saturated available training data. This means that GPT 5, 6 and 7 will look more like GPT 4, unless we develop some new techniques

Having said that, there is a lot of data being created constantly by humans and now LLMs :). So in some sense we may never run out of new data for LLM training.

Assuming we will continue to have new data to train LLMs, do we need bigger and bigger LLMs as we generate more and more training data?

Will the LLM's reasoning skills improve because it's being trained on bigger and bigger datasets, or is just that their knowledgeBase improves?

The Chinchilla Training Law, proposed by Google,
addresses LLM size vs. data and challenges traditional scaling laws by advocating for a ~21:1 ratio of training tokens to parameter size, instead of the conventional ~1:1 ratio, optimizing the balance between model size, training data, and compute budget.

This law posits that many models were "massively oversized and massively undertrained," suggesting a pathway towards more efficiently trained, cost-effective Large Language Models.

They recommend LLMs of size 70 billion parameters should be trained with 1,400 billion (1.4 trillion) tokens to achieve data-optimal training. For example Llama-2 70B is trained on 2T tokens!

So more data, doesn't necessarily mean bigger LLMs.

The next question is around data-quality. Again, like standard issue ML, focusing on data quality over quantity can vastly improve LLM performance. Gunasekar, et al. in their paper "Textbooks Are All You Need". The team trained a small model of just 1.3B parameters but used high quality data of filtered code, textbooks, and GPT-3.5-generated data.

High quality data dramatically improves the learning efficiency of language models for code as they provide clear, self-contained and instructive examples.

Fine-tuning LLM models yield excellent results, but only with the utilization of high-quality domain-specific data. The majority of the effort is spent on curating the data and continually refining it, based on the performance of the LLM.

Data matters, but high quality data matters more. When it comes to specific enterprise AI tasks, an open-source model fine-tuned on a high quality dataset has equivalent performance to GPT-4.

So the real challenge is high quality data. Enterprises have a lot of high quality training data for specific uses - example: a Q/A model on their knowledgebase.

For example, their customer support team by answering customer queries, has basically curated a large supervised training dataset.

Another source of high quality data is the human preference data. Humans can review LLM responses and provide feedback to the LLM. This can be used in further refine a model. Consumer services like ChatGPT are collecting a lot of human feedback through their chat interfaces

High quality data can in-turn be generated by LLMs. "self-instruct" is a method where you can generate training data for an LLM from itself. Alternatively you can use one LLM (say GPT-4) to generate a high quality training dataset to fine-tune another LLM.

Two instruction-tuned LLaMA models were compared, one fine-tuned on data generated by GPT-4 and the other on data generated by GPT-3. The model fine-tuned with GPT-4 generated data performed substantially better in the "Helpfulness" criterion, showcasing the utility of high-quality data generated by LLMs for fine-tuning

High-quality small datasets can be very useful for fine-tunes and you can also employ traditional human labelling techniques to generate those datasets.

In summary, you can fine-tune a small LLM (< 100B) for a particular task, if you have a high quality dataset

Good general purposes LLMs however are a different story and may still require a lot of data and training. However, it's not clear if just throwing more data at the problem will improve LLM reasoning dramatically.

It's also very likely that Auto-regressive LLMs while immensely useful for many business and consumer use-cases aren't going to get us to AGI, simply by "brute-forcing it"

One more reason for the Doomers to simmer down and let builders build :)

503

257

167K

Pietro Mascheroni @Pi_Mas

almost 3 years ago

@SpencrGreenberg As an AI language model I don't find anything strange in this application letter.

Pi_Mas retweeted

Morgan Delarue Research @DelarueResearch

almost 3 years ago

This offer is still on! Thanks for sharing! @ERC_Research

Pi_Mas retweeted

Curious Refuge

@CuriousRefuge

about 3 years ago

What if Wes Anderson directed The Lord of the Rings? We asked the community which video they want to see next and Lord of the Rings took the cake… or should we say Elven bread. We hope you enjoy this Midjourney to Middle-Earth. #LordOfTheRings #WesAnderson #MovieTrailer #LOTR

254

Pi_Mas retweeted

Terrible Maps

@TerribleMaps

about 3 years ago

237

15K

972

192

Pietro Mascheroni @Pi_Mas

about 3 years ago

@arnabbarua10 @Entropy_MDPI @M3sBiomath I love this! Thanks Arnab!

Pietro Mascheroni @Pi_Mas

about 3 years ago

@Rainmaker1973 It reminds me of this :D

379

Pi_Mas retweeted

M3s (B. Hatzikirou) @M3sBiomath

over 3 years ago

I am very proud to share the excellent work of my talented PhD student @Lito_MathBio_ about a theory-driven treatment vs Staphylococcus chronic infections. Here, the proposed treatment is supported and validated by murine experiments. https://t.co/nlynqf3TUZ

Pi_Mas retweeted

Papers with Code

@paperswithcode

over 3 years ago

🪐 Introducing Galactica. A large language model for science. Can summarize academic literature, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more. Explore and get weights: https://t.co/jKEP8S7Yfl