ParaCrawl @Paracrawl - Twitter Profile

over 1 year ago

Hunting for parallel data for Asian Languages? ParaCrawl just added 9 new bonus corpora. More info & paper by Philipp Koehn from @jhuclsp to be presented at WMT24 (#EMNLP2024): https://t.co/m2uEZ2k0kL The 9 datasets, as Bonus Release: https://t.co/Ncx2jlhmNF

ParaCrawl's tweet photo. Hunting for parallel data for Asian Languages? ParaCrawl just added 9 new bonus corpora. More info & paper by Philipp Koehn from @jhuclsp to be presented at WMT24 (#EMNLP2024): https://t.co/m2uEZ2k0kL

The 9 datasets, as Bonus Release: https://t.co/Ncx2jlhmNF https://t.co/F6gx4YteYh

0

5

3

0

685

ParaCrawl @ParaCrawl

about 2 years ago

Hi there, three new Bonus ParaCrawl languages have been just released: - English- Azerbaijani - English-Tajik - English-Armenian Go to the ParaCrawl website, scroll down to Bonus Languages (Low-Resource), download your preferred version: https://t.co/Ncx2jlhmNF

1

4

1

755

ParaCrawl retweeted

HPLT @hplt_eu

over 2 years ago

HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇

0

6

3

0

500

ParaCrawl retweeted

HPLT @hplt_eu

about 3 years ago

Interested in Open and Community-Driven MT initiatives? CrowdMT is for you! 🎙️Invited speakers from Wikimedia Foundation and Apertium announced. 📜Accepted papers and abstracts announced. Time to register at https://t.co/Sxpp59rDHp Details: https://t.co/oGlIb88HjG

0

1

2

0

305

Who to follow

The Machine Translate Foundation

@machtranslate

Open resources and community for machine translation

Jörg Tiedemann

@TiedemannJoerg

Marcin Junczys-Dowmunt (Marian NMT)

@marian_nmt

NLP. NMT. Main author of Marian NMT. Research Scientist at Microsoft Translator. Non-NLP silliness and stuff on @emjotde

ParaCrawl @ParaCrawl

about 3 years ago

And Icelandic!

0

2

0

65

ParaCrawl @ParaCrawl

about 3 years ago

Parallel (en-*) and monolingual new corpora from #MaCoCu just released. Included languages: Albanian Bosnian Bulgarian Croatian Macedonian Maltese Montenegrin Serbian Slovene Turkish

Taja Kuzman Pungeršek @TajaKuzman

about 3 years ago

We've published new #MaCoCu web corpora for 11 under-resourced languages! 56 million documents, 17 BILLION words (monolingual corpora) and 580 million words (English-X parallel corpora) were just uploaded to the https://t.co/uQlUZ0UlqA repository (https://t.co/fwlCrf5glK) 🥳

0

16

5

3

2K

1

7

0

1

456

ParaCrawl retweeted

Prompsit @Prompsit

about 3 years ago

#MT people: submission date extended for the CrowdMT workshop to present works on Open Source and Community-Driven MT: 21st April 2023! Abstracts and papers wanted! You wanted also in Tampere, for the whole #EAMT23 conference or at least for this workshop on the 15th of June!

Prompsit's tweet photo. #MT people: submission date extended for the CrowdMT workshop to present works on Open Source and Community-Driven MT: 21st April 2023!
Abstracts and papers wanted!
You wanted also in Tampere, for the whole #EAMT23 conference or at least for this workshop on the 15th of June! https://t.co/H7OHb67bu4

0

7

3

0

2K

ParaCrawl @ParaCrawl

over 3 years ago

@aihkas @BramVanroy @Nils_Reimers @huggingface @Reverso_ Hi @aihkas, sorry for the late reply. ParaCrawl website has a "Notice and take down policy" section with contact e-mail. Anonymized versions of ParaCrawl corpora (ROAM) were released to avoid these issues. We will make sure that your personal data gets removed, if still present.

1

0

ParaCrawl @ParaCrawl

over 3 years ago

@Nils_Reimers @BramVanroy @huggingface Sure Nils, check https://t.co/bpCHCRDhS0 for a full list of languages. The different efforts covered more than 40 languages. We just published a new one today, Polish-Czech. Unfortunately, only sentence-aligned.

1

0

ParaCrawl @ParaCrawl

over 3 years ago

@BramVanroy @Nils_Reimers @huggingface Sorry, guys, @ParaCrawl corpora are sentence aligned only. URLs are provided but documents are difficult to reconstruct.

1

0

ParaCrawl @ParaCrawl

over 3 years ago

A new ParaCrawl parallel corpus is available! 🌍 languages: Polish-Czech 🎒 size: 24 million sentences 🗒️ license: CC0 🎯 location: https://t.co/RomGYSHhdz bonus section 🧐 more info: https://t.co/m2uEZ22oWb

1

4

2

0

ParaCrawl retweeted

Prompsit @Prompsit

about 4 years ago

Indeed, this is the first data release of the #Macocu effort. You will find both monolingual and bilingual (with English) corpora on ELRC-Share and CLARIN repositories and the website. Insights coming soon! Most of the code also ready for you to try it out!

0

7

3

0

ParaCrawl @ParaCrawl

about 4 years ago

@vince62s Hi, publications coming soon, but see here MT results (spoiler, all BLEUs go up in V9): Also, yes, v9 and all the rest of versions are shuffled.

ParaCrawl's tweet photo. @vince62s Hi, publications coming soon, but see here MT results (spoiler, all BLEUs go up in V9):

Also, yes, v9 and all the rest of versions are shuffled. https://t.co/f3Cg99Zp5X

1

0

ParaCrawl @ParaCrawl

almost 5 years ago

Summer was for work! Now #ParaCrawl v9 corpora are done and again bigger than the previous ones!🤩 Extrinsic evaluation through MT almost finished and, according to old BLEU and new COMET, the quality of the MT output improves! 🥳 We will share corpora and more results soon!🕑

1

20

3

0

ParaCrawl @ParaCrawl

about 4 years ago

Check out MultiParacawl 9, including 36 parallel corpora for Ukrainian and a total of 705 bitexts. Thanks OPUS and @TiedemannJoerg to share this great resource! https://t.co/ZaDFtEHHrX

0

3

4

0

ParaCrawl retweeted

Barry Haddow @bazril

over 4 years ago

@anas_ant If you have an MT system, try bleualign (https://t.co/ZoaLrfbvQq) from @ParaCrawl . Scales to ParaCrawl-sized data.

2

5

1

2

0

ParaCrawl @ParaCrawl

over 4 years ago

We're back with more language resources: English-Ukrainian parallel corpus with aprox. 13M sentence pairs has been released. More info and downloads: https://t.co/6DLST6wo86 Please, spread the word and use it!

0

23

15

0

ParaCrawl @ParaCrawl

over 4 years ago

@jjon1910 Please try again, it was not you, but a typo in a script.😑 Thanks for reporting the issue and for your interest in being the first one downloader! 🤩

1

0

ParaCrawl @ParaCrawl

over 4 years ago

Done! All #ParaCrawl v9 corpora are now available at https://t.co/RomGYSYSC9, some also on Corset https://t.co/ocUOmmwrmM to further inspect or filter them and a new Bitextor is also out https://t.co/5c3Q3rmSUQ! Thanks to #CEF and the EU for co-funding this great project!

1

13

0

1

0

ParaCrawl @ParaCrawl

almost 5 years ago

Very clear TODO from #ParaCrawl's last stakeholder board meeting: we need better language identification, specially for closely-related languages and for under-resourced ones. Such a basic thing! Trying here to improve current results mixing Fastext and Hunspell, take a look👇

0

6

1

0

ParaCrawl

@ParaCrawl

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users