Algorithms for approximate nearest-neighbor search (ANNS) have been the topic of significant recent interest in the research community. However, evaluations of such algorithms are usually restricted to a small number of datasets with millions or tens of millions of points, whereas real-world applications require algorithms that work on the scale of billions of points. Furthermore, existing evaluations of ANNS algorithms are typically heavily focused on measuring and optimizing for queries-per second (QPS) at a given accuracy, which can be hardware-dependent and ignores important metrics such as build time. In this paper, we propose a set of principled measures for evaluating ANNS algorithms which refocuses on their scalability to billion-size datasets. These measures include ability to be efficiently parallelized, build times, and scaling relationships as dataset size increases. We also expand on the QPS measure with machine-agnostic measures such as the number of distance computations per query, and we evaluate ANNS data structures on their accuracy in more demanding settings required in modern applications, such as evaluating range queries and running on out-of-distribution data. We optimize four graph-based algorithms for the billion-scale setting, and in the process provide a general framework for making many incremental ANNS graph algorithms lock-free. We use our framework to evaluate the aforementioned graph-based ANNS algorithms as well as two alternative approaches.
翻译:近似最近邻搜索算法近年来引起了研究界的广泛关注。然而,这类算法的评估通常局限于数百万或数千万个点的少量数据集,而实际应用需要能够处理十亿个点规模的算法。此外,现有的ANNS算法评估通常高度聚焦于在给定精度下测量和优化每秒查询次数,这可能会依赖硬件配置,并忽略了构建时间等重要指标。本文提出了一套用于评估ANNS算法的原则性度量标准,重新聚焦于其在十亿级数据集上的可扩展性。这些度量包括高效并行化的能力、构建时间以及随数据集规模增加的扩展关系。我们还使用与机器无关的度量(如每次查询的距离计算次数)扩展了QPS度量,并在现代应用所需的更严格设置下评估ANNS数据结构,例如评估范围查询和在分布外数据上的运行。我们针对十亿级规模优化了四种基于图的算法,并在此过程中提供了一个通用框架,使许多增量式ANNS图算法能够实现无锁化。我们使用该框架评估了上述基于图的ANNS算法以及两种替代方法。