We consider the fundamental problem of decomposing a large-scale approximate nearest neighbor search (ANNS) problem into smaller sub-problems. The goal is to partition the input points into neighborhood-preserving shards, so that the nearest neighbors of any point are contained in only a few shards. When a query arrives, a routing algorithm is used to identify the shards which should be searched for its nearest neighbors. This approach forms the backbone of distributed ANNS, where the dataset is so large that it must be split across multiple machines. In this paper, we design simple and highly efficient routing methods, and prove strong theoretical guarantees on their performance. A crucial characteristic of our routing algorithms is that they are inherently modular, and can be used with any partitioning method. This addresses a key drawback of prior approaches, where the routing algorithms are inextricably linked to their associated partitioning method. In particular, our new routing methods enable the use of balanced graph partitioning, which is a high-quality partitioning method without a naturally associated routing algorithm. Thus, we provide the first methods for routing using balanced graph partitioning that are extremely fast to train, admit low latency, and achieve high recall. We provide a comprehensive evaluation of our full partitioning and routing pipeline on billion-scale datasets, where it outperforms existing scalable partitioning methods by significant margins, achieving up to 2.14x higher QPS at 90% recall$@10$ than the best competitor.
翻译:我们研究了将大规模近似最近邻搜索(ANNS)问题分解为若干子问题的基本方法。目标是将输入点划分为保持邻域关系的分片,使得任意点的最近邻仅包含在少数分片中。当查询到达时,路由算法用于识别应搜索其最近邻的分片。该方法构成了分布式ANNS的基础,其中数据集规模过大,必须分布在多台机器上。本文设计了简单高效的路由方法,并证明了其性能的强理论保证。我们路由算法的关键特性在于其内在的模块化特性,可配合任意划分方法使用。这解决了先前方法中路由算法与划分方法紧密耦合的关键缺陷。特别地,我们的新路由方法能够使用均衡图划分——这是一种高质量却缺乏天然路由算法的划分方法。因此,我们首次提供了基于均衡图划分的路由方法,该方法训练速度快、延迟低且召回率高。我们在十亿级数据集上对完整划分与路由流水线进行了全面评估,结果表明该方法显著优于现有可扩展划分方法,在90%召回率@10条件下,QPS最高可达最佳竞品的2.14倍。