Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.
5/5
We evaluate the downstream impact of quality filtering on Wikipedia by training tiny monolingual pretrained models for each Wikipedia to find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for LRLs.
4/5
Our work on quality estimation for non-English Wikipedia articles is finally out in the wild 👀. It spread before we had the chance to publicise it haha, but watch out for our upcoming thread on this next week!
"How Good is Your Wikipedia?" a critical analysis of the Wikipedia content quality beyond English, revealing widespread issues such as a high percentage of one-line articles and duplicate articles.
(Tatariya et al, 2024)
https://t.co/YFp4XGMESX
I am happy to present our latest work at EMNLP 2024 on the interpretability of pixel-based language models! 🎉
@vgaraujov@ThomasBauwens_@mdlhx
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models
https://t.co/DAPVveutEB
#NLProc
This was my first venture into language model intepretability, and I've learnt a lot of cool things during this project. I hope everyone finds it an interesting read!
Additionally, we examine variants of PIXEL trained with different text rendering strategies, discovering that introducing certain orthographic constraints at the input level can facilitate earlier learning of surface-level features.
Leuven goes to Leiden in 10 days! We'll be presenting two posters about data quality of non-English Wikipedias and about typologically informed language sampling 👀 see you there!
The camera ready version is now up!
https://t.co/jadE5UFPkK
We hope to present this at ACL next year.
To summarize our contributions:
1. The first ever benchmark for Creole NLP
2. 8 NLP tasks and 28 Creoles
3. Human generated/checked data
Hopefully this is used as a starting point for future work on Creoles.
✨Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification
Paper: https://t.co/FTfqw50rhF
Talk: https://t.co/12GwI57tKR
#SIGTYP2024#SIGTYP#EACL2024
Spoiler: We find that PLMs do get more influenced by Hindi words to predict negative emotions, and by English words to predict positive emotions. Moreover, the PLMs may also overgeneralise this learning to examples where it does not apply.
My paper on 'Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification' is now on arXiv!🎉
Watch out for it at SIGTYP @ EACL 2024! #NLProc@mdlhx @heather_nlp @johannesbjerva
https://t.co/MpOmz4uUsd
We use LIME and token-level language ID to examine the effect of language on emotion prediction across 3 PLMs finetuned on a Hinglish emotion classification dataset.