@emollick I am with @ylecun here, if I understand him correctly. We need a paradigmatic shift in terms of architecture: current LLM architectures are doomed to commit errors, no easy way out - no matter how much we optimize pipelines.
Good Quality Data, Not Compute, Is LLM Gold
GPU shortage has somewhat eased up and a number of companies including Amazon, Google and Micosoft are trying to compete with Nvidia with their own LLM friendly chips
We are also seeing a big trend towards smaller models being as performant as large models. Mistral's 7B outperforms 13B and Llama-2 70B can be tuned to get GPT 3.5 (180B)'s performance
If I have to make a bet, I would say a MOE (mixture of experts) architecture of instruct tuned open-source models where you have an ensemble of models, each an "expert" at a particular type of task can potentially achieve GPT-4 like performance. If each of the models < 100B in size, then they would also be accessible for research and mass market use (i.e. serve-able by the GPU poor).
Serving super large models is not only expensive but is also cumbersome. Given that, < 100B models such as Llama-2 have shown promise, I suspect the GPU crunch will soon be over. Both because of multiple players coming into the market and by more efficient models.
The next thing to consider is data, how much more performance can we get out these LLMs by expanding their training datasets?
Can OpenAI go into a infinite loop and successively train more and more powerful and performant models?
LLMs plateau in performance once they've "used up" the information in their training set. This fundamentally means that we can run out of "training data".
In traditional machine learning, this is reflected in the fact that model performance doesn't improve even if you train with larger larger datasets, as long as the sample you are training on, is robust and is reflective of the underlying data distribution
Some believe that we already saturated available training data. This means that GPT 5, 6 and 7 will look more like GPT 4, unless we develop some new techniques
Having said that, there is a lot of data being created constantly by humans and now LLMs :). So in some sense we may never run out of new data for LLM training.
Assuming we will continue to have new data to train LLMs, do we need bigger and bigger LLMs as we generate more and more training data?
Will the LLM's reasoning skills improve because it's being trained on bigger and bigger datasets, or is just that their knowledgeBase improves?
The Chinchilla Training Law, proposed by Google,
addresses LLM size vs. data and challenges traditional scaling laws by advocating for a ~21:1 ratio of training tokens to parameter size, instead of the conventional ~1:1 ratio, optimizing the balance between model size, training data, and compute budget.
This law posits that many models were "massively oversized and massively undertrained," suggesting a pathway towards more efficiently trained, cost-effective Large Language Models.
They recommend LLMs of size 70 billion parameters should be trained with 1,400 billion (1.4 trillion) tokens to achieve data-optimal training. For example Llama-2 70B is trained on 2T tokens!
So more data, doesn't necessarily mean bigger LLMs.
The next question is around data-quality. Again, like standard issue ML, focusing on data quality over quantity can vastly improve LLM performance. Gunasekar, et al. in their paper "Textbooks Are All You Need". The team trained a small model of just 1.3B parameters but used high quality data of filtered code, textbooks, and GPT-3.5-generated data.
High quality data dramatically improves the learning efficiency of language models for code as they provide clear, self-contained and instructive examples.
Fine-tuning LLM models yield excellent results, but only with the utilization of high-quality domain-specific data. The majority of the effort is spent on curating the data and continually refining it, based on the performance of the LLM.
Data matters, but high quality data matters more. When it comes to specific enterprise AI tasks, an open-source model fine-tuned on a high quality dataset has equivalent performance to GPT-4.
So the real challenge is high quality data. Enterprises have a lot of high quality training data for specific uses - example: a Q/A model on their knowledgebase.
For example, their customer support team by answering customer queries, has basically curated a large supervised training dataset.
Another source of high quality data is the human preference data. Humans can review LLM responses and provide feedback to the LLM. This can be used in further refine a model. Consumer services like ChatGPT are collecting a lot of human feedback through their chat interfaces
High quality data can in-turn be generated by LLMs. "self-instruct" is a method where you can generate training data for an LLM from itself. Alternatively you can use one LLM (say GPT-4) to generate a high quality training dataset to fine-tune another LLM.
Two instruction-tuned LLaMA models were compared, one fine-tuned on data generated by GPT-4 and the other on data generated by GPT-3. The model fine-tuned with GPT-4 generated data performed substantially better in the "Helpfulness" criterion, showcasing the utility of high-quality data generated by LLMs for fine-tuning
High-quality small datasets can be very useful for fine-tunes and you can also employ traditional human labelling techniques to generate those datasets.
In summary, you can fine-tune a small LLM (< 100B) for a particular task, if you have a high quality dataset
Good general purposes LLMs however are a different story and may still require a lot of data and training. However, it's not clear if just throwing more data at the problem will improve LLM reasoning dramatically.
It's also very likely that Auto-regressive LLMs while immensely useful for many business and consumer use-cases aren't going to get us to AGI, simply by "brute-forcing it"
One more reason for the Doomers to simmer down and let builders build :)
What if Wes Anderson directed The Lord of the Rings? We asked the community which video they want to see next and Lord of the Rings took the cake⦠or should we say Elven bread. We hope you enjoy this Midjourney to Middle-Earth.
#LordOfTheRings#WesAnderson#MovieTrailer#LOTR
I am very proud to share the excellent work of my talented PhD student @Lito_MathBio_ about a theory-driven treatment vs Staphylococcus chronic infections. Here, the proposed treatment is supported and validated by murine experiments. https://t.co/nlynqf3TUZ
πͺ Introducing Galactica. A large language model for science.
Can summarize academic literature, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.
Explore and get weights: https://t.co/jKEP8S7Yfl