This study focuses on the optimization of the Big-means algorithm for clustering large-scale datasets, exploring four distinct parallelization strategies. We conducted extensive experiments to assess the computational efficiency, scalability, and clustering performance of each approach, revealing their benefits and limitations. The paper also delves into the trade-offs between computational efficiency and clustering quality, examining the impacts of various factors. Our insights provide practical guidance on selecting the best parallelization strategy based on available resources and dataset characteristics, contributing to a deeper understanding of parallelization techniques for the Big-means algorithm.
翻译:本研究聚焦于面向大规模数据集聚类的Big-Means算法优化,系统探讨了四种不同并行化策略。我们通过广泛实验评估了每种方法的计算效率、可扩展性及聚类性能,揭示了其优势与局限性。本文还深入分析了计算效率与聚类质量之间的权衡关系,考察了多种因素的影响作用。基于资源可用性与数据集特征,我们提出的见解为选择最优并行化策略提供了实践指导,深化了对Big-Means算法并行化技术的理解。