Reverse k-nearest neighbor (RkNN) search returns all data points that regard a query vector as one of their k-nearest neighbors (kNNs). Existing RkNN methods typically follow a filter-and-verification framework: vectors near the query vector are first collected as candidates and then verified against their kNN-radius (i.e., the distance to their k-th nearest neighbor). However, existing methods face two key limitations in high-dimensional spaces. First, nearby vectors often do not belong to the query's true RkNN set, resulting in excessive candidate expansion overhead. Second, existing methods compute kNN-radius online during verification, incurring substantial query-processing cost. To address these limitations, we propose HRNN, a hybrid graph index for approximate RkNN search. (1) Rather than directly treating nearby vectors as RkNN candidates, HRNN uses them as proxy points based on the assumption that a query's RkNN results can often be discovered through the RkNN results of its nearby vectors. (2) To reduce verification cost, HRNN materializes high-fidelity kNN-radius offline, eliminating expensive online reconstruction while preserving accuracy. HRNN combines a navigation graph, a ranked KNN graph, and reverse-neighbor lists into a hybrid index that supports efficient proxy retrieval, candidate generation, and kNN-radius access. We also develop efficient index construction and append-only maintenance algorithms. Extensive experiments show that HRNN consistently outperforms existing methods, achieving up to one order of magnitude higher throughput. Moreover, HRNN scales to datasets containing up to 10 million high-dimensional vectors while supporting efficient dynamic index maintenance.
翻译:反向k最近邻(RkNN)搜索返回所有将查询向量视为其k个最近邻之一的数据点。现有RkNN方法通常采用过滤-验证框架:首先收集查询向量附近的向量作为候选,然后通过与它们的kNN半径(即与第k个最近邻的距离)进行验证。然而,现有方法在高维空间中面临两个关键限制。第一,附近的向量通常不属于查询的真实RkNN集合,导致过度的候选扩展开销。第二,现有方法在验证过程中在线计算kNN半径,产生大量查询处理成本。为解决这些限制,我们提出HRNN,一种用于近似RkNN搜索的混合图索引。(1)HRNN不直接将附近向量视为RkNN候选,而是基于查询的RkNN结果通常可通过其附近向量的RkNN结果发现的假设,将它们用作代理点。(2)为降低验证成本,HRNN离线物化高保真度的kNN半径,在保持精度的同时消除昂贵的在线重构。HRNN将导航图、排序KNN图和反向邻居列表组合为混合索引,支持高效的代理检索、候选生成和kNN半径访问。我们还开发了高效的索引构建和仅追加维护算法。大量实验表明,HRNN始终优于现有方法,吞吐量提升高达一个数量级。此外,HRNN可扩展至包含多达1000万个高维向量的数据集,同时支持高效的动态索引维护。