Approximate nearest neighbor search is fundamental in information retrieval. Previous partition-based methods enhance search efficiency by probing partial partitions, yet they face two common issues. In the query phase, a common strategy is to probe partitions based on the distance ranks of a query to partition centroids, which inevitably probes irrelevant partitions as it ignores data distribution. In the partition construction phase, all partition-based methods face the boundary problem that separates a query's nearest neighbors to multiple partitions, resulting in a long-tailed kNN distribution and degrading the optimal nprobe (i.e., the number of probing partitions). To address this gap, we propose LIRA, a LearnIng-based queRy-aware pArtition framework. Specifically, we propose a probing model to directly probe the partitions containing the kNN of a query, which can reduce probing waste and allow for query-aware probing with nprobe individually. Moreover, we incorporate the probing model into a learning-based redundancy strategy to mitigate the adverse impact of the long-tailed kNN distribution on search efficiency. Extensive experiments on real-world vector datasets demonstrate the superiority of LIRA in the trade-off among accuracy, latency, and query fan-out. The codes are available at https://github.com/SimoneZeng/LIRA-ANN-search.
翻译:近似最近邻搜索是信息检索中的基础任务。现有的基于分区的方法通过探测部分分区来提升搜索效率,但仍面临两个普遍问题。在查询阶段,常见策略是根据查询到分区质心的距离排序来探测分区,这种方法因忽略数据分布而不可避免地探测无关分区。在分区构建阶段,所有基于分区的方法都面临边界问题——即将查询的最近邻分散到多个分区,导致kNN分布呈现长尾特性,进而影响最优nprobe(即探测分区数量)的设定。为弥补这一不足,本文提出LIRA——一种基于学习的查询感知分区框架。具体而言,我们设计了一种探测模型,可直接定位包含查询kNN的分区,从而减少无效探测,并支持针对单个查询的自适应nprobe设置。此外,我们将探测模型与基于学习的冗余策略相结合,以缓解长尾kNN分布对搜索效率的负面影响。在真实世界向量数据集上的大量实验表明,LIRA在准确率、延迟和查询扇出之间取得了优越的权衡效果。代码已开源:https://github.com/SimoneZeng/LIRA-ANN-search。