Excited to share Flash-GMM, our new work from IBM Research on scaling Gaussian Mixture Models on GPUs 🚀
Paper: https://t.co/fOWZToMcX9
Code: https://t.co/1dPYk7dKuz
A simple idea inspired by FlashAttention enables GMMs to scale to billions of data points on a single GPU.
Zero-shot instruction-following, nuanced classification and reasoning? ➡️ LLMs!
Real-world low-latency retrieval over a large-scale corpus? ➡️ Embedding models!
But what if you need BOTH at once? That's what our new paper 💡 is all about...
https://t.co/WmDM1JnZA1
🧵