Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)
Excellent report from @tweetbaack@mozilla on Common Crawl, used to train many LLMs.
Throwaway line for news publishers to ponder: "We will focus on the main crawl because the news crawl is rarely used by AI builders to train their LLMs (only once in our sample of 47 [models])."
Long term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways (10/10)
Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)
A key issue is that filtered Common Crawl versions are not updated after their original publication to take feedback and criticism into account. We need dedicated intermediaries that filter Common Crawl in transparent and accountable ways that are continuously updated (9/10)