Approximate Nearest Neighbor Search (ANNS) in high-dimensional Euclidean spaces is a fundamental problem with broad applications. Subspace Collision is a newly proposed ANNS framework that provides a novel paradigm for similarity search and achieves superior indexing and query performance. However, the subspace collision framework remains data-agnostic and query-oblivious, resulting in imbalanced index construction and wasted query overhead. In this paper, we address these limitations from two aspects: first, we design a subspace-oriented data transformation mechanism by averaging the entropies computed over each subspace of the transformed data, which ensures balanced subspace partitioning (in an information theoretical sense) and enables data-adaptive subspace collision; second, we present query-aware and scalable query strategies that dynamically allocate overhead for each query and accelerate collision probing within subspaces. Building on these ideas, we propose a novel data-adaptive and query-aware subspace collision method, abbreviated as TaCo, which achieves efficient and accurate ANN search while maintaining an excellent balance between indexing and query performance. Extensive experiments on real-world datasets demonstrate that, when compared to state-of-the-art subspace collision methods, TaCo achieves up to 8x speedup in indexing and reduces to 0.6x memory footprint, while achieving over 1.5x query throughput. Moreover, TaCo achieves state-of-the-art indexing performance and provides an effective balance between indexing and query efficiency, even when compared with advanced methods beyond the subspace-collision paradigm. This paper was published in SIGMOD 2026.
翻译:摘要:高维欧氏空间中的近似最近邻搜索(ANNS)是一个具有广泛应用的基问题。子空间碰撞是一种新提出的ANNS框架,为相似性搜索提供了新颖范式,并实现了优越的索引构建与查询性能。然而,现有子空间碰撞框架仍具有数据无关性与查询无感知性,导致索引构建不均衡及查询开销浪费。本文从两个方面解决这些局限:首先,通过计算变换数据各子空间的信息熵均值,设计了一种面向子空间的数据变换机制,确保(信息论意义上的)子空间均衡划分,从而实现数据自适应子空间碰撞;其次,提出了查询感知且可扩展的查询策略,能够动态分配每个查询的开销并加速子空间内的碰撞探测。基于上述思想,我们提出了一种新型数据自适应与查询感知子空间碰撞方法,简称为TaCo,该方法在实现高效准确ANN搜索的同时,维持了索引与查询性能间的优异平衡。在真实数据集上的大量实验表明,与现有最优子空间碰撞方法相比,TaCo在索引构建上实现最高8倍加速,内存占用降至0.6倍,查询吞吐量提升1.5倍以上。此外,即使与子空间碰撞范式之外的先进方法相比,TaCo仍能达到最优的索引构建性能,并提供索引与查询效率的有效平衡。本文发表于SIGMOD 2026。