https://t.co/K9Bb5un0Sw
Reflections on Recent Talks at the Turing Institute and UCL
Thom Vaughan and Pedro Ortiz Suarez discussed the power of Common Crawl’s open web data in driving research and innovation during two notable presentations last week
My new app is out !!
✨The Common Crawl Pipeline Creator ✨
Create your pipeline easily:
✔Run Text Extraction✂️
✔Define Language Filters🌐
✔Customize text quality💯
✔See Live Results👀
✔Get Python code 🐍
Based on famous LLM research like Gopher, C4 or FineWeb
We are launching Salamandra 2B & 7B multilingual LLMs trained at @BSC_CNS from scratch with nearly 8 trillion tokens in 35 EU languages+code. Spanish languages have been carefully curated, with Romance languages comprising >30% of the training dataset.
https://t.co/g6WEWvVmSs
👇
The @CommonCrawl team just released all their statistics publicly ! And across all dumps, including the more recent one 🙌
Now is the time to finally analyze what's inside the source of most pre-training datasets out there 👀
1/n
The February/March 2024 main crawl is out! 🥳🚀📚
It was an amazing learning experience to do this bimonthly crawl with my colleague @thomvaughan. Please do let us know if you have any feedback!
Happy data crunching! 🧑💻
It's not the lovely review of Sandman Act II on @audible_com that surprised me. What astonished me is that we got an actual stand alone review for an original audiobook, not as part of a round up. It feels like a real acknowledgement of @DirkMaggs work.
https://t.co/KoAF2UzGCZ