We present SuperKMeans: a k-means variant designed for clustering collections of high-dimensional vector embeddings. SuperKMeans' clustering is up to 7x faster than FAISS and Scikit-Learn on modern CPUs and up to 4x faster than cuVS on GPUs (Figure 1), while maintaining the quality of the resulting centroids for vector similarity search tasks. SuperKMeans acceleration comes from reducing data-access and compute overhead by reliably and efficiently pruning dimensions that are not needed to assign a vector to a centroid. Furthermore, we present Early Termination by Recall, a novel mechanism that early-terminates k-means when the quality of the centroids for retrieval tasks stops improving across iterations. In practice, this further reduces runtimes without compromising retrieval quality. We open-source our implementation at https://github.com/cwida/SuperKMeans
翻译:我们提出SuperKMeans:一种专为高维向量嵌入聚类设计的K-Means变体。在保持向量相似性搜索任务中质心质量的前提下,SuperKMeans的聚类速度在现代CPU上比FAISS和Scikit-Learn快7倍,在GPU上比cuVS快4倍(图1)。SuperKMeans的加速源于通过高效可靠地剪枝对向量质心分配不必要的维度,从而降低数据访问和计算开销。此外,我们提出基于召回率的早停机制(Early Termination by Recall),该机制在检索任务中质心质量停止随迭代提升时提前终止K-Means运行。实践中,该方法可在不损害检索质量的前提下进一步减少运行时间。我们已在https://github.com/cwida/SuperKMeans开源实现代码。