Research Scientist @ Jasper Research | Ph.D in Applied Maths (Generative Models) @Inria. I also maintain python packages democratizing Deep Generative Models.
After 8 months of long coding nights ☕️ we finally officially release Pythae 🥳, a python library unifying generative autoencoder implementations including vaegan🥗, vqvae or RAEs.
🖥️ github repo: https://t.co/570oxztyn3
👉paper: https://t.co/Nh5BgRWtU7
Feels quite magical to be able to clone a 68 TB dataset to my private HF training bucket while I only have a 4TB local disk, all of that in less than a minute thanks to HF infra optimizations & xet dedup!
We are starting to be quite bullish about getting in the data infrastructure business.
I just cloned 68 TB (while I only have a 4TB local disk) to my @huggingface training bucket in 1 minute 55 seconds, thanks to Xet deduplication and all our infra optimizations.
You can host your data processing pipelines on HF and leverage those insane optimizations 🔥
📢 New @heyjasper release ! 📢
MONET 🌸 : An Apache2.0 deduped and recaptioned dataset of 105M samples unlocking reproducible text-to-image research.
Nano T2I 🖌️ : A codebase to train your own T2I model
🤗 @huggingface: https://t.co/x6gEhQIaFV
💻: https://t.co/K6VIU2wjtW
Very excited about this new release, pushing the boundaries of open and reproducible T2I research.
Congrats to the team!
Benjamin Aubin Gonzalo Quintana @onurxtasar@UlaLaParis@_jeev2@dh7net@clipdropapp@heyjasperai
Huge open release from @heyjasperai : MONET
105M curated image-text pairs, Apache 2.0, with embeddings, VAE latents, multi-VLM captions, and a companion training repo (nano-t2i) to train a T2I model end-to-end on one H200 for <$300.
Congrats @CChadebec & co 👏
With 104M of image-text pairs, this is one of the largest, if not the largest, openly-licensed image dataset
And it's on @huggingface!!
Kudos @heyjasperai
We put in place a rigorous and meticulous filtering, deduplicating, and re-captioning pipeline to create MONET:
⛽ Sourced from 2.9B images from open datasets (LAION, COYO, etc.)
✅ Filtered for high-res, aesthetics & strict safety/NSFW standards
👬 Deduplicated & stripped of stock/watermarked images
💬 Re-captioned using 4 top VLMs for rich, diverse text descriptions
🕹️ Augmented with safe, permissive synthetic data
Using the MONET dataset exclusively, we trained a 4B T2I model from scratch. Built on an MMDiT-inspired architecture and trained via latent flow matching with a deep compression VAE, the model can generate images up to 2048x2048 resolution.
📜 : https://t.co/Kf6zDtNHTD
💻: https://t.co/K6VIU2wjtW