Thom Vaughan @thomvaughan - Twitter Profile

Thom Vaughan @thomvaughan

over 1 year ago

@Markbuschn Throwing this into the mix why not https://t.co/6Tv6RQXuLI

0

3

0

1

520

thomvaughan retweeted

Common Crawl Foundation

@CommonCrawl

over 1 year ago

https://t.co/K9Bb5un0Sw Reflections on Recent Talks at the Turing Institute and UCL Thom Vaughan and Pedro Ortiz Suarez discussed the power of Common Crawl’s open web data in driving research and innovation during two notable presentations last week

0

4

2

0

894

thomvaughan retweeted

The Wayback Machine

@waybackmachine

over 1 year ago

@AliwolfFurry @internetarchive @brewster_kahle We don't archive the archive... runs the risk of causing a rift in the space-time continuum.

5

248

18

9

4K

thomvaughan retweeted

Quentin Lhoest 🤗 @lhoestq

over 1 year ago

My new app is out !! ✨The Common Crawl Pipeline Creator ✨ Create your pipeline easily: ✔Run Text Extraction✂️ ✔Define Language Filters🌐 ✔Customize text quality💯 ✔See Live Results👀 ✔Get Python code 🐍 Based on famous LLM research like Gopher, C4 or FineWeb

5

105

22

73

15K

Who to follow

ClayMation_Nation

@ClaymatioNation

Home of the Twitch channel @ ClayMatioNation where you can get plenty of Minecraft, World of Warcraft, Overwatch, and more. React content coming soon.

THE MØST

@TmoTrips_

EQUIPO MAS OFICIAL #THEMOSTOFFICIAL #THEMOSTOFFICIALSTAMP #MOSAGANGINIT #MADEINTHEBRIX #THEFINESTMOSAIC

Rip Sanghani

@rip_sanghani

@Strava

thomvaughan retweeted

Marta Villegas @MartaVillegasM

over 1 year ago

We are launching Salamandra 2B & 7B multilingual LLMs trained at @BSC_CNS from scratch with nearly 8 trillion tokens in 35 EU languages+code. Spanish languages have been carefully curated, with Romance languages comprising >30% of the training dataset. https://t.co/g6WEWvVmSs 👇

7

104

45

20

7K

thomvaughan retweeted

Quentin Lhoest 🤗 @lhoestq

almost 2 years ago

The @CommonCrawl team just released all their statistics publicly ! And across all dumps, including the more recent one 🙌 Now is the time to finally analyze what's inside the source of most pre-training datasets out there 👀 1/n

lhoestq's tweet photo. The @CommonCrawl team just released all their statistics publicly ! And across all dumps, including the more recent one 🙌

Now is the time to finally analyze what's inside the source of most pre-training datasets out there 👀

1/n https://t.co/XGiv8p4g3i

2

120

22

58

14K

thomvaughan retweeted

Common Crawl Foundation

@CommonCrawl

about 2 years ago

https://t.co/zK1g04ps1h Our 100th crawl!!

0

17

6

3

4K

thomvaughan retweeted

Common Crawl Foundation

@CommonCrawl

about 2 years ago

https://t.co/TPvsvEDsvr

0

7

3

1

2K

thomvaughan retweeted

Common Crawl Foundation

@CommonCrawl

over 2 years ago

https://t.co/3P1iRXHSoT

0

7

2

916

thomvaughan retweeted

Pedro Ortiz Suarez @pjox13

over 2 years ago

The February/March 2024 main crawl is out! 🥳🚀📚 It was an amazing learning experience to do this bimonthly crawl with my colleague @thomvaughan. Please do let us know if you have any feedback! Happy data crunching! 🧑‍💻

0

9

1

0

416

thomvaughan retweeted

Common Crawl Foundation

@CommonCrawl

over 2 years ago

https://t.co/LGB2CqU1wc

0

13

6

2

1K

thomvaughan retweeted

Common Crawl Foundation

@CommonCrawl

over 2 years ago

https://t.co/p9AMlfxDE2

0

6

3

0

763

thomvaughan retweeted

Neil Gaiman

@neilhimself

over 4 years ago

It's not the lovely review of Sandman Act II on @audible_com that surprised me. What astonished me is that we got an actual stand alone review for an original audiobook, not as part of a round up. It feels like a real acknowledgement of @DirkMaggs work. https://t.co/KoAF2UzGCZ

12

473

41

3

0