Large-scale Nearest Neighbor (NN) search, though widely utilized in the similarity search field, remains challenged by the computational limitations inherent in processing large scale data. In an effort to decrease the computational expense needed, Approximate Nearest Neighbor (ANN) search is often used in applications that do not require the exact similarity search, but instead can rely on an approximation. Product Quantization (PQ) is a memory-efficient ANN effective for clustering all sizes of datasets. Clustering large-scale, high dimensional data requires a heavy computational expense, in both memory-cost and execution time. This work focuses on a unique way to divide and conquer the large scale data in Python using PQ, Inverted Indexing and Dask, combining the results without compromising the accuracy and reducing computational requirements to the level required when using medium-scale data.
翻译:大规模最近邻(NN)搜索虽在相似性搜索领域广泛应用,但处理大规模数据时仍受限于计算瓶颈。为降低所需计算开销,近似最近邻(ANN)搜索常被应用于无需精确相似性搜索、仅需近似结果的场景。乘积量化(PQ)是一种内存高效的ANN方法,适用于任意规模数据集的聚类任务。然而,对大规模高维数据进行聚类在内存开销与执行时间上均需承担沉重的计算成本。本研究聚焦于一种独特的分治策略,通过结合乘积量化、倒排索引与Dask框架在Python中实现大规模数据并行处理,在保证精度的前提下将计算开销降至与中等规模数据处理相当的水平。