Similarity search, the task of identifying objects most similar to a given query object under a specific metric, has gathered significant attention due to its practical applications. However, the absence of coordinate information to accelerate similarity search and the high computational cost of measuring object similarity hinder the efficiency of existing CPU-based methods. Additionally, these methods struggle to meet the demand for high throughput data management. To address these challenges, we propose GTS, a GPU-based tree index designed for the parallel processing of similarity search in general metric spaces, where only the distance metric for measuring object similarity is known. The GTS index utilizes a pivot-based tree structure to efficiently prune objects and employs list tables to facilitate GPU computing. To efficiently manage concurrent similarity queries with limited GPU memory, we have developed a two-stage search method that combines batch processing and sequential strategies to optimize memory usage. The paper also introduces an effective update strategy for the proposed GPU-based index, encompassing streaming data updates and batch data updates. Additionally, we present a cost model to evaluate search performance. Extensive experiments on five real-life datasets demonstrate that GTS achieves efficiency gains of up to two orders of magnitude over existing CPU baselines and up to 20x efficiency improvements compared to state-of-the-art GPU-based methods.
翻译:摘要:相似性搜索是在特定度量下识别与给定查询对象最相似对象的任务,因其实际应用而备受关注。然而,缺乏坐标信息来加速相似性搜索,以及度量对象相似性的高计算成本,阻碍了现有基于CPU方法的效率。此外,这些方法难以满足高吞吐量数据管理的需求。为应对这些挑战,我们提出GTS,一种基于GPU的树索引,专为在通用度量空间中并行处理相似性搜索而设计,其中仅已知用于度量对象相似性的距离度量。GTS索引利用基于枢轴的树结构高效剪枝对象,并采用列表表来促进GPU计算。为在有限的GPU内存中高效管理并发相似性查询,我们开发了一种两阶段搜索方法,结合批处理与顺序策略以优化内存使用。本文还引入了所提出基于GPU索引的有效更新策略,涵盖流式数据更新和批量数据更新。此外,我们提出一个成本模型来评估搜索性能。在五个真实数据集上的广泛实验表明,与现有CPU基线方法相比,GTS实现了高达两个数量级的效率提升,而与最先进的基于GPU方法相比,效率提升高达20倍。