Love seeing the timeline wake up to the fact that data is the most underinvested area in ML.
But let’s set the record straight: the world’s premier data research company isn't hypothetical. It already exists. It’s called @datologyai, and we’ve been building it for 2.5 years. 🧵
+1 - Seeing some of my great coworkers get unceremoniously dropped was part of my impetus for leaving Meta in the first place. Would love to chat about how green the grass is over at Datology, DM me if you're curious.
If you were impacted by the recent Meta layoffs (or even if you weren't) and you're interested in doing ambitious, rigorous science and/or engineering that powers a real product that actual customers pay us ca$h money for, please DM me or head over to https://t.co/4truVQtHWy. We're particularly interested in people that have experience with any of data curation, post-training, training stacks/infra, and data infra.
If I've learned anything in my last 2 years at @datologyai it's that running production-scale research opens a whole host of interesting engineering problems. This is just a peek.
The folks at @AmplifyPartners went deep with my team at @datologyai on the engineering challenges involved in large-scale data curation for training models-- from deduplicating a non-trivial fraction of the internet to orchestrating dozens of experiments and terabytes of data every single day:
https://t.co/mTh2mOmY87
. @datologyai is back: state of the art CLIP model performance using data curation alone 🚀
✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2
✅ 8x training efficiency gains
✅ 2x inference efficiency gains
✅ Public model release
Details in the 🧵 thread below 👇
Are you a Researcher or Engineer (or something between) interested in driving results like these? Excited by the idea of pushing the frontier of AI through improvements in data? Join us! We’re hiring for full-time and internship positions: https://t.co/dQaN3aDUN2
13/13
Incredibly excited to be sharing what we’ve been working on at @DatologyAI. Thanks to a ton of work from an incredible team, I think we have a strong showing for the impact of our curation for LLMs: We make RPJv1 better than the best available pretraining datasets around.
🧵1/n
Closing out, we’re only just getting started on this ride, and week over week we’re pushing the frontier of generalizing curation methodologies, getting stable results from them, and putting them into production.
12/n
🧵We’ve spent the last few months at @datologyai building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
Hello world! We are incredibly excited to come out of stealth today to help make better data accessible to everyone, automatically.
Hear from our founders about our mission and vision for DatologyAI:
https://t.co/trNyGhM8jt
🚨 New work: BlenderBot 3x 🚨
- Public data release & analysis of 6M chat interactions.
- Learns by conversing with people in the real world: training on this data improves BB3 from 85.3% → 94.4% good messages.
paper: https://t.co/HOiPcbAFng
project: https://t.co/kTNr2rJyIV