Researchers made KMeans 200x faster.
And the new technique also beats approaches like cuML and FAISS.
Flash-KMeans is an IO-aware implementation of exact KMeans that redesigns the algorithm around modern GPU bottlenecks.
By attacking the memory bottlenecks directly, Flash-KMeans achieves:
- 33x speedup over cuML
- 200x speedup over FAISS
This speedup comes from how it moves through GPU memory.
Standard KMeans runs in two steps, and both are bottlenecked by reads and writes to GPU memory:
1) The first step matches every point to its nearest centroid.
Standard KMeans computes the full point-to-centroid distance matrix, writes it out to GPU memory, then reads it back to find each nearest centroid. That write-then-read round trip is the bottleneck.
Flash-KMeans combines the distance calculation with the nearest-centroid step, so the result is computed on-chip and the full matrix is never written out.
2) The second step recomputes each centroid by averaging the points assigned to it.
Standard KMeans has thousands of threads writing into the same centroid slots at once, so they stall waiting for their turn.
Flash-KMeans sorts points by cluster first, turning scattered writes into sequential reductions that read and write memory in one efficient pass.
Using these two optimizations at the million-scale, Flash-KMeans completes a standard KMeans iteration in a few milliseconds.
The video below depicts this in action.
Several reasons why this is important:
KMeans has always been an offline primitive. Something you run once to preprocess data and move on.
These speedups make the approach viable in several runtime-critical systems.
↳ Vector indices like FAISS use KMeans to build search indices. Faster KMeans means you can re-index dynamically as data changes.
↳ LLM quantization methods need KMeans to find optimal weight codebooks, per layer, repeatedly. What takes hours could now take minutes.
↳ MoE models need fast token routing at inference time. Flash-KMeans makes it viable to run this inside the inference loop, not just in preprocessing.
I have shared the paper in the replies.
That said, memory is the real constraint Flash-KMeans solves, and the problem is not just limited to clustering. The vectors a RAG system stores after indexing create similar bottlenecks.
I wrote a detailed walkthrough recently on cutting this vector memory by 32x with binary quantization, querying 36M+ vectors in a few milliseconds.
Read it below.
Our new paper is just out!
https://t.co/2IVw6V32kg
We demonstrate that unsupervised visual perceptual learning (VPL) can occur following exposure to task-irrelevant natural scene images. This form of unsupervised VPL was more robust for natural scenes than for artificial images matched in lower-order image statistics. Our results indicate that this advantage is attributable to higher-order statistics in natural scenes, which appear to be less susceptible to attentional suppression.
The work ultimately involved 14 experiments across behavior, fMRI, and eye tracking, with 25 figures and 8 tables. A genuine team effort!