Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime complexity. In contrast, Contraction Clustering (RASTER) is a single-pass algorithm for identifying density-based clusters with linear time complexity. Due to its favorable runtime and the fact that its memory requirements are constant, this algorithm is highly suitable for big data applications where the amount of data to be processed is huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. This algorithm is extremely fast in both sequential and parallel execution. Our quantitative evaluation shows that a sequential implementation of RASTER performs significantly better than various standard clustering algorithms. Furthermore, the parallel speedup is significant: on a contemporary workstation, an implementation in Rust processes a batch of 500 million points with 1 million clusters in less than 50 seconds on one core. With 8 cores, the algorithm is about four times faster.
翻译:聚类是用于分析和分组相似对象的重要数据挖掘工具。然而,在大数据应用中,许多聚类算法因其高内存需求和/或不利的时间复杂度而难以实施。相比之下,收缩聚类(RASTER)是一种单次遍历算法,能以线性时间复杂度识别基于密度的聚类。凭借其优异的时间性能与恒定的内存需求,该算法非常适合处理海量数据的大数据应用。它包含两个步骤:(1)收缩步骤:将对象投影至网格单元;(2)聚合步骤:将网格单元分组为聚类簇。该算法在顺序执行与并行执行中均表现出极快的速度。我们的定量评估表明,RASTER的顺序实现性能显著优于多种标准聚类算法。此外,其并行加速效果显著:在当代工作站上,基于Rust语言的单核实现可在50秒内处理包含5亿个数据点、100万个聚类簇的数据批次;使用8核时,算法速度可提升约四倍。