Many fields are experiencing a Big Data explosion, with data collection rates outpacing the rate of computing performance improvements predicted by Moore's Law. Researchers are often interested in similarity search on such data. We present CAKES (CLAM-Accelerated $K$-NN Entropy Scaling Search), a novel algorithm for $k$-nearest-neighbor ($k$-NN) search which leverages geometric and topological properties inherent in large datasets. CAKES assumes the manifold hypothesis and performs best when data occupy a low dimensional manifold, even if the data occupy a very high dimensional embedding space. We demonstrate performance improvements ranging from hundreds to tens of thousands of times faster when compared to state-of-the-art approaches such as FAISS and HNSW, when benchmarked on 5 standard datasets. Unlike locality-sensitive hashing approaches, CAKES can work with any user-defined distance function. When data occupy a metric space, CAKES exhibits perfect recall.
翻译:众多领域正经历大数据爆炸,数据采集速度已超越摩尔定律预测的计算性能提升速率。研究人员通常关注此类数据的相似性搜索。本文提出CAKES(CLAM加速的K近邻熵缩放搜索),一种利用大数据集固有几何与拓扑性质的新型k近邻(k-NN)搜索算法。CAKES基于流形假设,当数据位于低维流形时性能最优,即便数据处于极高维的嵌入空间。我们在5个标准数据集上的基准测试表明,与FAISS、HNSW等最先进方法相比,CAKES的性能提升从数百倍到数万倍不等。与局部敏感哈希方法不同,CAKES可兼容任意用户定义的距离函数。当数据位于度量空间时,CAKES可实现完美召回率。