I had a fun coding session at the end of last week.. I implemented NaFlexGenLIP. It's an impl of GenLIP using NaFlex style image tokenization and packing instead of the full NaViT style sequence packing. Prelim CC12M sanity training on my local RTX Pro 6000s is showing some signs of life even with such a small model and dataset🥳
I was going to do this as part of a new project but w/ recent OpenCLIP refactoring it was easy enough to bolt on there initially to get something that's ready for scale experiments sooner. I threw in some utils to calculate text sequence length and batch budget params based on dataset caption dist. Also hacked together a generative 'zero-shot' image classification idea based on likelihoods.
tpuf quantizes vectors to improve perf (RaBitQ)
the algo randomly rotates vectors, and we were using matmul at O(d²) space & time, brutal at high dims. 10k = 400MB in RAM!
we rebuilt the rotation using FWHT at O(d) space & O(d log d) time. ~no recall loss, 10k = only 5kB in RAM
@n0riskn0r3ward I agree.
Based on my years of experience training and shipping embedding models, the quality of training data for embedding models is very important.
@julien_c@huggingface That sounds insane.
Is it a real copy or lazy copy (you incur the cost during the access, or you do not copy anything, you just access it, in which case cross-region access becomes a bottleneck).
@ndea The paper:
Recursive Program Synthesis
Authors: Aws Albarghouthi, Sumit Gulwani, Zachary Kincaid
University of Toronto, Microsoft Research
https://t.co/su2XJapLiG
@poteto Feels like people have been putting too much effort to develop these (tools--including cursor--, skills, harnesses, etc.).
They will become obsolete soon (they will be absorbed or fixed by the AI companies).