This study focuses on the optimization of the Big-means algorithm for clustering large-scale datasets, exploring four distinct parallelization strategies. We conducted extensive experiments to assess the computational efficiency, scalability, and clustering performance of each approach, revealing their benefits and limitations. The paper also delves into the trade-offs between computational efficiency and clustering quality, examining the impacts of various factors. Our insights provide practical guidance on selecting the best parallelization strategy based on available resources and dataset characteristics, contributing to a deeper understanding of parallelization techniques for the Big-means algorithm.
翻译:本研究聚焦于面向大规模数据集聚类的Big-means算法优化,探索了四种不同的并行化策略。我们通过大量实验评估了每种策略的计算效率、可扩展性及聚类性能,揭示了各自的优势与局限性。论文进一步探讨了计算效率与聚类质量之间的权衡关系,分析了多种因素的影响作用。本文提出的见解为基于可用资源和数据集特征选择最佳并行化策略提供了实践指导,有助于深化对Big-means算法并行化技术的理解。