Dr. Stefan Baack | @tootbaack@infosec.exchange @tweetbaack - Twitter Profile

Pinned Tweet

Dr. Stefan Baack | @[email protected] @tweetbaack

over 2 years ago

Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)

2

118

33

87

24K

tweetbaack retweeted

Rasmus Kleis Nielsen @rasmus_kleis

over 2 years ago

Excellent report from @tweetbaack @mozilla on Common Crawl, used to train many LLMs. Throwaway line for news publishers to ponder: "We will focus on the main crawl because the news crawl is rarely used by AI builders to train their LLMs (only once in our sample of 47 [models])."

0

7

2

3

2K

tweetbaack retweeted

emily bell @emilybell

over 2 years ago

Really useful paper describing the use, effects and limitations of Common Crawl as a building block for LLMs

0

3

1

2

2K

tweetbaack retweeted

MMitchell

@mmitchell_ai

over 2 years ago

Common Crawl data is likely used in most large language models (AI), as far as we know. This is *crucial* work.

1

98

12

25

14K

Who to follow

Neil Thurman

@neilthurman

Professor @LMU_Muenchen, Honorary Senior Research Fellow @CityJournalism & former @VolkswagenSt Fellow. Get my new book, Media Change: https://t.co/649u0HmQTd

Christoph Lutz

@lutzid

professor at @HandelshoyskBI @BI_NCIS in comm & internet, researching social media & the digital economy

Fernando van der Vlist | @fvandervlist.bsky.social

@fvandervlist

Media Studies @UvA_Amsterdam · 🔍 platforms, apps, data, AI, digital methods · 👥 @appstudies @digitalmethods @PublicDataLab · 🕓 Previous @UU_GDS @mediaofcoop

Dr. Stefan Baack | @[email protected] @tweetbaack

over 2 years ago

Long term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways (10/10)

0

3

2

0

572

Dr. Stefan Baack | @[email protected] @tweetbaack

over 2 years ago

Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)

2

118

33

87

24K

Dr. Stefan Baack | @[email protected] @tweetbaack

over 2 years ago

A key issue is that filtered Common Crawl versions are not updated after their original publication to take feedback and criticism into account. We need dedicated intermediaries that filter Common Crawl in transparent and accountable ways that are continuously updated (9/10)

1

2

0

520

Dr. Stefan Baack | @[email protected]

@tweetbaack

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users