This paper studies density-based clustering of point sets. These methods use dense regions of points to detect clusters of arbitrary shapes. In particular, we study variants of density peaks clustering, a popular type of algorithm that has been shown to work well in practice. Our goal is to cluster large high-dimensional datasets, which are prevalent in practice. Prior solutions are either sequential, and cannot scale to large data, or are specialized for low-dimensional data. This paper unifies the different variants of density peaks clustering into a single framework, PECANN, by abstracting out several key steps common to this class of algorithms. One such key step is to find nearest neighbors that satisfy a predicate function, and one of the main contributions of this paper is an efficient way to do this predicate search using graph-based approximate nearest neighbor search (ANNS). To provide ample parallelism, we propose a doubling search technique that enables points to find an approximate nearest neighbor satisfying the predicate in a small number of rounds. Our technique can be applied to many existing graph-based ANNS algorithms, which can all be plugged into PECANN. We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading. Compared to the state-of-the-art FASTDP algorithm for high-dimensional density peaks clustering, which is sequential, our best algorithm is 45x-734x faster while achieving competitive ARI scores. Compared to the state-of-the-art parallel DPC-based algorithm, which is optimized for low dimensions, we show that PECANN is two orders of magnitude faster. As far as we know, our work is the first to evaluate DPC variants on large high-dimensional real-world image and text embedding datasets.
翻译:摘要: 本文研究基于密度的点集聚类方法。此类方法利用高密度点区域检测任意形状的聚类,重点分析在实际应用中表现优异的密度峰值聚类算法变体。我们的目标是对大规模高维数据集进行聚类——这类数据集在实际场景中普遍存在。现有解决方案要么是串行算法无法扩展至大规模数据,要么专门针对低维数据优化。通过抽象该算法族共有的若干关键步骤,本文提出统一框架PECANN,将不同密度峰值聚类变体整合于一体。其中关键步骤之一是查找满足谓词函数的最近邻,而本文的主要贡献在于提出基于图近似最近邻搜索(ANNS)的高效谓词搜索方法。为充分实现并行化,我们设计了倍增搜索技术,使数据点能在少量迭代轮次内找到满足谓词的近似最近邻。该技术可适用于多种现有基于图的ANNS算法,且这些算法均可嵌入PECANN框架。我们使用PECANN实现了五种聚类算法,并在包含128万数据点、维度高达1024的合成与真实数据集上,通过配备双路超线程的30核处理器进行评测。与当前最先进的高维密度峰值聚类串行算法FASTDP相比,我们的最优算法在保持竞争性ARI分数的同时实现了45倍至734倍的加速比。相较于专为低维优化的并行DPC算法,PECANN展现出两个数量级的性能提升。据我们所知,本研究是首个在大规模高维真实图像与文本嵌入数据集上评估DPC变体的工作。