Here are some of the quality issues that you may find:
‣ Duplicates
‣ Outliers
‣ Mislabels
‣ Corrupted images
‣ Train/test leakage
‣ Overly bright/dark/blurry images
Notebooks:
‣ Kaggle Notebook - https://t.co/ZXhQ0KnrC8
‣ Colab Notebook - https://t.co/IfCbUpKPKo
@nietras1 we have just packaged Meta's #dinov2 model using #fastdup. It should be super easy to run:
import fastdup
fd=fastdup.create(input_dir=<your images>, work_dir=<output folder>)
https://t.co/yzXjUTj2S8(model_path='dinov2s')
fd.vis.component_gallery()
LAION results:
@Eric_Wallace_ Thanks for featuring our github repo fastdup title image! https://t.co/Qcn2yNFIae Everyone should try us out for deduplicating large scale image repos. It is free!
See our paper for a lot more technical details and results.
Speaking personally, I have many thoughts on this paper. First, everyone should de-duplicate their data as it reduces memorization. However, we can still extract non-duplicated images in rare cases! [6/9]
@Suhail Hi @Suhail we are building exactly that, and coincidentally you already used us! https://t.co/XWqBBDspJt
We would love to chat if you are open to it and explore collaboration.
@Suhail Hi @Suhail I am the co-creator of fastdup, it is great to learn you managed to clean 500,000 in one hour including learning our tool. We would love to hear what are you doing with images and see if we can help in any way!