Approximate nearest neighbor (ANN) search with range filters has recently garnered significant attention. This paper delves into a generalized form of this problem, i.e., ANN search with exact range-range (RR) predicates on a range-valued attribute, named RR filtering ANN (RRANN). Specifically, given $n$ vectors in $\mathbb{R}^d$, each vector $v_i$ is associated with a numeric range $[l_i, r_i]$, symbolizing aspects like a price range or time interval. An RRANN query $(v_q, l_q, r_q)$ aims at finding $k$ vectors closest to $v_q$ within the vectors satisfying an arbitrary RR predicate defined between the query range $[l_q, r_q]$ and the object range $[l_i, r_i]$. The RR predicate remains unspecified, enabling user-defined conditions. It may encompass containment ($[l_i, r_i] \subseteq [l_q, r_q]$ or $[l_q, r_q] \subseteq [l_i, r_i]$), overlap ($l_i \le l_q \le r_i \le r_q$ or $l_q \le l_i \le r_q \le r_i$), or a disjunction of them. RRANN has broad applications in queries related to price ranges or time intervals, and it generalizes existing variants of ANN search with range filters. However, existing dedicated approaches for these problems lack the capacity to support queries with arbitrary RR predicates. Hence, we introduce a new approach, labeled multi-segment tree graph. It efficiently handles arbitrary RR predicates by avoiding traversal through non-predicate-satisfied nodes, and keeps equivalent index size and construction time to state-of-the-art methods for RFANN. Extensive experiments on real-world data demonstrate the efficacy of our approach in RRANN queries, achieving up to 12.5x speedups with the same accuracy as the baselines. Moreover, our approach attains comparable RFANN search performance and notably superior IFANN and TSANN search performance compared to the respective state-of-the-art approaches. Our code is available at https://github.com/FanEDG/MSTG.
翻译:近似最近邻(ANN)搜索结合范围过滤近期受到广泛关注。本文深入探讨该问题的广义形式,即对范围值属性施加精确范围-范围(RR)谓词的ANN搜索,称为RR过滤ANN(RRANN)。具体而言,给定$\mathbb{R}^d$中的$n$个向量,每个向量$v_i$关联一个数值范围$[l_i, r_i]$,用于表示价格区间或时间间隔等属性。RRANN查询$(v_q, l_q, r_q)$的目标是在满足查询范围$[l_q, r_q]$与对象范围$[l_i, r_i]$之间任意RR谓词的向量中,找到离$v_q$最近的$k$个向量。RR谓词未预先指定,支持用户自定义条件。它可包含包含关系($[l_i, r_i] \subseteq [l_q, r_q]$ 或 $[l_q, r_q] \subseteq [l_i, r_i]$)、重叠关系($l_i \le l_q \le r_i \le r_q$ 或 $l_q \le l_i \le r_q \le r_i$),或这些关系的析取。RRANN在价格区间或时间间隔相关的查询中具有广泛应用,且泛化了现有带范围过滤的ANN搜索变体。然而,现有针对这些问题设计的专用方法无法支持任意RR谓词的查询。为此,我们提出了一种新方法——多段树图。该方法通过避免遍历不满足谓词的节点,高效处理任意RR谓词,并保持与RFANN最先进方法相当的索引大小和构建时间。在真实数据集上的大量实验表明,我们的方法在RRANN查询中效果显著,在保持与基线相同精度的前提下,实现高达12.5倍的加速比。此外,与各自领域的最先进方法相比,我们的方法在RFANN搜索性能上表现相当,并在IFANN和TSANN搜索性能上显著更优。我们的代码已开源在 https://github.com/FanEDG/MSTG。