UNIFY: Unified Index for Range Filtered Approximate Nearest Neighbors Search

This paper presents an efficient and scalable framework for Range Filtered Approximate Nearest Neighbors Search (RF-ANNS) over high-dimensional vectors associated with attribute values. Given a query vector $q$ and a range $[l, h]$, RF-ANNS aims to find the approximate $k$ nearest neighbors of $q$ among data whose attribute values fall within $[l, h]$. Existing methods including pre-, post-, and hybrid filtering strategies that perform attribute range filtering before, after, or during the ANNS process, all suffer from significant performance degradation when query ranges shift. Though building dedicated indexes for each strategy and selecting the best one based on the query range can address this problem, it leads to index consistency and maintenance issues. Our framework, called UNIFY, constructs a unified Proximity Graph-based (PG-based) index that seamlessly supports all three strategies. In UNIFY, we introduce SIG, a novel Segmented Inclusive Graph, which segments the dataset by attribute values. It ensures the PG of objects from any segment combinations is a sub-graph of SIG, thereby enabling efficient hybrid filtering by reconstructing and searching a PG from relevant segments. Moreover, we present Hierarchical Segmented Inclusive Graph (HSIG), a variant of SIG which incorporates a hierarchical structure inspired by HNSW to achieve logarithmic hybrid filtering complexity. We also implement pre- and post-filtering for HSIG by fusing skip list connections and compressed HNSW edges into the hierarchical graph. Experimental results show that UNIFY delivers state-of-the-art RF-ANNS performance across small, mid, and large query ranges.

翻译：本文提出了一种高效且可扩展的框架，用于对具有属性值的高维向量进行范围过滤近似最近邻搜索。给定查询向量 $q$ 和范围 $[l, h]$，RF-ANNS 的目标是在属性值落在 $[l, h]$ 范围内的数据中，找到 $q$ 的近似 $k$ 个最近邻。现有方法包括在 ANNS 过程之前、之后或期间执行属性范围过滤的前过滤、后过滤及混合过滤策略，这些方法在查询范围变化时均会遭受显著的性能下降。虽然为每种策略构建专用索引并根据查询范围选择最佳索引可以解决此问题，但这会导致索引一致性和维护问题。我们提出的框架 UNIFY 构建了一个统一的基于邻近图的索引，该索引无缝支持所有三种策略。在 UNIFY 中，我们引入了 SIG，一种新颖的分段包容图，它按属性值对数据集进行分段。它确保来自任何分段组合的对象的 PG 都是 SIG 的子图，从而能够通过从相关分段重建并搜索 PG 来实现高效的混合过滤。此外，我们提出了分层分段包容图，它是 SIG 的一种变体，其结合了受 HNSW 启发的层次结构，以实现对数级的混合过滤复杂度。我们还通过将跳表连接和压缩的 HNSW 边融合到分层图中，为 HSIG 实现了前过滤和后过滤。实验结果表明，UNIFY 在小、中、大查询范围内均能提供最先进的 RF-ANNS 性能。