This paper introduces a novel formulation of the clustering problem, namely the Minimum Sum-of-Squares Clustering of Infinitely Tall Data (MSSC-ITD), and presents HPClust, an innovative set of hybrid parallel approaches for its effective solution. By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability. In contrast to vanilla data parallelism, which only accelerates processing time through the MapReduce framework, our approach unlocks superior performance by leveraging the multi-strategy competitive-cooperative parallelism and intricate properties of the objective function landscape. Unlike other available algorithms that struggle to scale, our algorithm is inherently parallel in nature, improving solution quality through increased scalability and parallelism, and outperforming even advanced algorithms designed for small and medium-sized datasets. Our evaluation of HPClust, featuring four parallel strategies, demonstrates its superiority over traditional and cutting-edge methods by offering better performance in the key metrics. These results also show that parallel processing not only enhances the clustering efficiency, but the accuracy as well. Additionally, we explore the balance between computational efficiency and clustering quality, providing insights into optimal parallel strategies based on dataset specifics and resource availability. This research advances our understanding of parallelism in clustering algorithms, demonstrating that a judicious hybridization of advanced parallel approaches yields optimal results for MSSC-ITD. Experiments on synthetic data further confirm HPClust's exceptional scalability and robustness to noise.
翻译:本文提出了一种新的聚类问题表述,即无限高数据的最小化平方和聚类(MSSC-ITD),并介绍了HPClust——一套用于有效解决该问题的创新性混合并行方法。通过利用现代高性能计算技术,HPClust提升了关键聚类指标:有效性、计算效率和可扩展性。与仅通过MapReduce框架加速处理时间的传统数据并行方法相比,我们的方法通过利用多策略竞争-合作并行机制以及目标函数景观的复杂特性,实现了更优的性能。与其他难以扩展的现有算法不同,我们的算法本质上是并行的,通过增强的可扩展性和并行性提高了求解质量,其性能甚至超越了专为中小型数据集设计的高级算法。我们对包含四种并行策略的HPClust进行的评估表明,其在关键指标上优于传统及前沿方法,提供了更佳的性能。这些结果还表明,并行处理不仅提升了聚类效率,也提高了聚类精度。此外,我们探讨了计算效率与聚类质量之间的平衡,基于数据集特性和资源可用性为最优并行策略的选择提供了见解。本研究增进了对聚类算法中并行机制的理解,证明了先进并行方法的审慎混合能为MSSC-ITD带来最优结果。在合成数据上的实验进一步证实了HPClust卓越的可扩展性和对噪声的鲁棒性。