This paper presents GMASK, a general algorithm for distributed approximate similarity search that accepts any arbitrary distance function. GMASK requires a clustering algorithm that induces Voronoi regions in a dataset and returns a representative element for each region. Then, it creates a multilevel indexing structure suitable for large datasets with high dimensionality and sparsity, usually stored in distributed systems. Many similarity search algorithms rely on $k$-means, typically associated with the Euclidean distance, which is inappropriate for specific problems. Instead, in this work we implement GMASK using $k$-medoids to make it compatible with any distance and a wider range of problems. Experimental results verify the applicability of this method with real datasets, improving the performance of alternative algorithms for approximate similarity search. In addition, results confirm existing intuitions regarding the advantages of using certain instances of the Minkowski distance in high-dimensional datasets.
翻译:本文提出GMASK算法,这是一种支持任意距离函数的分布式近似相似性搜索通用算法。GMASK需要一种能够在数据集中生成Voronoi区域并为每个区域返回代表元素的聚类算法。随后,该算法构建适用于高维稀疏大规模数据集的多级索引结构,此类数据集通常存储于分布式系统中。许多相似性搜索算法依赖$k$-means方法,该方法通常与欧氏距离相关联,但在特定问题中并不适用。为此,本研究采用$k$-medoids方法实现GMASK算法,使其能够兼容任意距离度量并适用于更广泛的问题场景。实验结果验证了该方法在真实数据集上的适用性,其性能优于现有的近似相似性搜索替代算法。此外,实验结果证实了在高维数据集中使用特定闵可夫斯基距离实例的优势,这与现有理论认知相一致。