Hybrid Approximate Nearest Neighbor Search (Hybrid ANNS) is a foundational search technology for large-scale heterogeneous data and has gained significant attention in both academia and industry. However, current approaches overlook the heterogeneity in data distribution, thus ignoring two major challenges: the Compatibility Barrier for Similarity Magnitude Heterogeneity and the Tolerance Bottleneck to Attribute Cardinality. To overcome these issues, we propose the robuSt heTerogeneity-Aware hyBrid retrievaL framEwork, STABLE, designed for accurate, efficient, and robust hybrid ANNS under datasets with various distributions. Specifically, we introduce an enhAnced heterogeneoUs semanTic perceptiOn (AUTO) metric to achieve a joint measurement of feature similarity and attribute consistency, addressing similarity magnitude heterogeneity and improving robustness to datasets with various attribute cardinalities. Thereafter, we construct our Heterogeneous sEmantic reLation graPh (HELP) index based on AUTO to organize heterogeneous semantic relations. Finally, we employ a novel Dynamic Heterogeneity Routing method to ensure an efficient search. Extensive experiments on five feature vector benchmarks with various attribute cardinalities demonstrate the superior performance of STABLE.
翻译:混合近似最近邻搜索是面向大规模异构数据的基础性搜索技术,在学术界和工业界都引发了广泛关注。然而,现有方法忽视了数据分布的异质性,导致未能解决两大核心挑战:相似性幅度异质性导致的兼容性壁垒,以及属性基数造成的容错瓶颈。为攻克这些难题,我们提出鲁棒异构感知混合检索框架STABLE,旨在处理不同分布的数据集时实现准确、高效且鲁棒的混合近似最近邻搜索。具体而言,我们设计增强型异质语义感知(AUTO)度量,通过联合度量特征相似度与属性一致性,解决相似性幅度异质性问题,并提升对不同属性基数数据集的鲁棒性。在此基础上,基于AUTO构建异质语义关系图(HELP)索引来组织异构语义关系。最后,我们采用新型动态异构路由方法确保高效搜索。在五个不同属性基数的特征向量基准上的大量实验表明,STABLE具有优越的性能。