LIRA：一种基于学习的大规模近似最近邻搜索查询感知分区框架 (LIRA: A Learning-based Query-aware Partition Framework for Large-scale ANN Search)

Approximate nearest neighbor search is fundamental in information retrieval. Previous partition-based methods enhance search efficiency by probing partial partitions, yet they face two common issues. In the query phase, a common strategy is to probe partitions based on the distance ranks of a query to partition centroids, which inevitably probes irrelevant partitions as it ignores data distribution. In the partition construction phase, all partition-based methods face the boundary problem that separates a query's nearest neighbors to multiple partitions, resulting in a long-tailed kNN distribution and degrading the optimal nprobe (i.e., the number of probing partitions). To address this gap, we propose LIRA, a LearnIng-based queRy-aware pArtition framework. Specifically, we propose a probing model to directly probe the partitions containing the kNN of a query, which can reduce probing waste and allow for query-aware probing with nprobe individually. Moreover, we incorporate the probing model into a learning-based redundancy strategy to mitigate the adverse impact of the long-tailed kNN distribution on search efficiency. Extensive experiments on real-world vector datasets demonstrate the superiority of LIRA in the trade-off among accuracy, latency, and query fan-out. The codes are available at https://github.com/SimoneZeng/LIRA-ANN-search.

翻译：近似最近邻搜索是信息检索中的基础任务。现有的基于分区的方法通过探测部分分区来提升搜索效率，但仍面临两个普遍问题。在查询阶段，常见策略是根据查询到分区质心的距离排序来探测分区，这种方法因忽略数据分布而不可避免地探测无关分区。在分区构建阶段，所有基于分区的方法都面临边界问题——即将查询的最近邻分散到多个分区，导致kNN分布呈现长尾特性，进而影响最优nprobe（即探测分区数量）的设定。为弥补这一不足，本文提出LIRA——一种基于学习的查询感知分区框架。具体而言，我们设计了一种探测模型，可直接定位包含查询kNN的分区，从而减少无效探测，并支持针对单个查询的自适应nprobe设置。此外，我们将探测模型与基于学习的冗余策略相结合，以缓解长尾kNN分布对搜索效率的负面影响。在真实世界向量数据集上的大量实验表明，LIRA在准确率、延迟和查询扇出之间取得了优越的权衡效果。代码已开源：https://github.com/SimoneZeng/LIRA-ANN-search。