Similarity search finds objects that are similar to a given query object based on a similarity metric. As the amount and variety of data continue to grow, similarity search in metric spaces has gained significant attention. Metric spaces can accommodate any type of data and support flexible distance metrics, making similarity search in metric spaces beneficial for many real-world applications, such as multimedia retrieval, personalized recommendation, trajectory analytics, data mining, decision planning, and distributed servers. However, existing studies mostly focus on indexing metric spaces on a single machine, which faces efficiency and scalability limitations with increasing data volume and query amount. Recent advancements in similarity search turn towards distributed methods, while they face challenges including inefficient local data management, unbalanced workload, and low concurrent search efficiency. To this end, we propose DIMS, an efficient Distributed Index for similarity search in Metric Spaces. First, we design a novel three-stage heterogeneous partition to achieve workload balance. Then, we present an effective three-stage indexing structure to efficiently manage objects. We also develop concurrent search methods with filtering and validation techniques that support efficient distributed similarity search. Additionally, we devise a cost-based optimization model to balance communication and computation cost. Extensive experiments demonstrate that DIMS significantly outperforms existing distributed similarity search approaches.
翻译:相似性搜索基于相似性度量,寻找与给定查询对象相似的对象。随着数据量和多样性的持续增长,度量空间中的相似性搜索受到了广泛关注。度量空间可以容纳任何类型的数据并支持灵活的距离度量,这使得度量空间中的相似性搜索对许多现实应用有益,例如多媒体检索、个性化推荐、轨迹分析、数据挖掘、决策规划以及分布式服务器。然而,现有研究大多集中于在单机上为度量空间建立索引,随着数据量和查询量的增加,这面临着效率和可扩展性的限制。近期相似性搜索的进展转向分布式方法,但这些方法面临着局部数据管理效率低下、工作负载不均衡以及并发搜索效率低等挑战。为此,我们提出了DIMS,一种用于度量空间中相似性搜索的高效分布式索引。首先,我们设计了一种新颖的三阶段异构分区方法以实现工作负载均衡。其次,我们提出了一种有效的三阶段索引结构来高效管理对象。我们还开发了结合过滤与验证技术的并发搜索方法,以支持高效的分布式相似性搜索。此外,我们设计了一种基于成本的优化模型来平衡通信与计算开销。大量实验表明,DIMS显著优于现有的分布式相似性搜索方法。