New paper where we explore using a small LM’s perplexity to prune the pretraining data for larger LMs.
We find that small LMs can prune data for up to 30x larger LMs, data pruning works in the overtrained and data-constrained regimes, and more!
https://t.co/XYbI0Ijois
I've added support for Command-R to llama.cpp!
Command-R is an exciting new 35B model with 128k context length for RAG and Tool Use
I also converted the model to GGUF format (F16, Q8, Q4, Q2)
HF: https://t.co/SKqKUGM2kM
Release: https://t.co/EivPPhm4gm
@cohere@francoisfleuret
{UCSB|AI2|UW|Stanford|MIT|UofT|Vector|Contextual AI} present a survey on🔎Data Selection for LLMs🔍
Training data is a closely guarded secret in industry🤫with this work we narrow the knowledge gap, advocating for open, responsible, collaborative progress
https://t.co/vpRIXWFdCZ
Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models.
AlphaCode, ChatGPT+, Gemini are examples.
In this post, we discuss why this is and emerging research on designing & optimizing such systems.
https://t.co/tfnNuoTNNY
Today, I am very proud share what we have been working on for the last 14 months. ✨
Introducing Aya -- a new state-of-art for massively multilingual models. 🔥🎉
Thrilled to announce Aya 🌿, a massively multilingual instruction-tuned LLM, featuring 101 languages and the largest collection of multilingual instruction datasets. Over half of these languages are under-resourced. A monumental effort from @CohereForAI and Aya team 🚀
@hongjian_zou heya thanks! All models received the same number of training steps and used the same amount of compute regardless of the dataset pruning. If the dataset was pruned down to 50%, the model trained on that dataset saw each datapoint twice.
LLMs improved using available data from the noisy Internet.
@CohereForAI researchers achieved unexpected results by pruning data.
Their research suggests removing most pretraining data while maintaining performance!
In 2022, we Launched the Cohere For AI Scholars Program to help close the gap between research experience and opportunity. In our inaugural year, we welcomed 6 talented researchers - @luizapzbn, @lekeonilude, @maxdoesresearch, @aahmadian_, @tedzadouri and Meriem Boubdir.
📢New Pretraining Paper 📢
Delighted to share our new paper coming out of @forai_ml : "When Less is More: Investigating Data Pruning for Pretaining LLMs at Scale"
Paper: https://t.co/VwtiDGpRek
w/ @ahmetustun89@luizapzbn@W4ngatang@mziizm@sarahookr
Really proud of our work led by @maxdoesresearch w @ahmetustun89@luizapzbn@W4ngatang@mziizm 🎉
LM datasets are huge. Is all text needed? How can we measure data quality in this setting? Enter data pruning: removing subsets least valuable while preserving performance.
You're intuitions on the easy/hard data is on par with what we found - very easy data was often user agreements or text that would appear all over the internet, like at the bottom of a webpage. The harder subset is more complicated - some of it was nonsense, but some text, like medical or scientific text, can have high perplexity but could still useful for certain contexts. Selecting a good validation set would, ironically, be an excellent extension of this line of work 😂
@EIFY@forai_ml@ahmetustun89@luizapzbn@W4ngatang@mziizm@sarahookr Our EL2N experiments are a version of this, in that we use the same paramete/arch setup and use signals from those models as our pruning metric. The setup you mention is possible but was more complicated engineering wise for us. You would need do some gradient updates, as...