Today we released “GPIC: A Giant Permissive Image Corpus for Visual Generation.” It’s a 100M image dataset for visual generation, with text captions and 100% known+permissive licenses, hosted on HuggingFace. I’m excited to get this out! Check it out: https://t.co/6pZ66Nihgx
GPIC should be the new standard benchmark for generative modeling. Training 1 epoch on GPIC is the same cost as 100 epochs on ImageNet, but is a much better proxy for real-world problems. If you work in generative modeling, try GPIC for your next project!
Today we released “GPIC: A Giant Permissive Image Corpus for Visual Generation.” It’s a 100M image dataset for visual generation, with text captions and 100% known+permissive licenses, hosted on HuggingFace. I’m excited to get this out! Check it out: https://t.co/6pZ66Nihgx
One practical example is epoch count – “state-of-the-art” models on ImageNet-1K train for 300-1700 epochs (Fig. credit: PixGen). But that’s not the way you would do things outside of an academic comparison – you’d just go get more data!
Personally: I spent a portion of my PhD working on strong tokenizers, which is sort of at odds with the current ImageNet-1K meta to add regularization whenever possible, so I’m also personally excited to see how this dataset drives tokenization research. Happy pretraining!
I’m also proud of this section of the paper, which gives best practices for compliance with our eval protocol. Without calling out anyone in particular, let me just say that using auxiliary foundation models to get a better FD-DINOv2 on GPIC without being very up front about the huge advantages of the extra data and model FLOPs is super bad – please don’t do it!
1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation!
🚀100M VLM-captioned image-text pairs for training
📊1M image-text pairs for benchmarking
🖼️~28 trillion pixels
🤗Centrally Hosted
✅Fully permissive for research + commercial use
Dataset, benchmark and models🧵👇
Co-led with @KyleSargentAI
Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below!
with @AlecRad and @status_effects 🧵
Today @YuzuHealthInc announces our Series A!
Our mission is to bring trust and agency back to health insurance.
Thank you to our customers, partners, and team who made this possible!
Blog and hiring link in the comments.