The extensive use of HPC infrastructures and frameworks for running dataintensive applications has led to a growing interest in data partitioning techniques and strategies. In fact, application performance can be heavily affected by how data are partitioned, which in turn depends on the selected size for data blocks, i.e. the block size. Therefore, finding an effective partitioning, i.e. a suitable block size, is a key strategy to speed-up parallel data-intensive applications and increase scalability. This paper describes a methodology, namely BLEST-ML (BLock size ESTimation through Machine Learning), for block size estimation that relies on supervised machine learning techniques. The proposed methodology was evaluated by designing an implementation tailored to dislib, a distributed computing library highly focused on machine learning algorithms built on top of the PyCOMPSs framework. We assessed the effectiveness of the provided implementation through an extensive experimental evaluation considering different algorithms from dislib, datasets, and infrastructures, including the MareNostrum 4 supercomputer. The results we obtained show the ability of BLEST-ML to efficiently determine a suitable way to split a given dataset, thus providing a proof of its applicability to enable the efficient execution of data-parallel applications in high performance environments.
翻译:HPC基础设施和框架在大数据密集型应用中的广泛使用,使得数据划分技术与策略日益受到关注。事实上,数据划分方式会显著影响应用性能,而划分效果又取决于所选数据块的大小(即块大小)。因此,寻找有效划分方案(即合适的块大小)是加速并行数据密集型应用并提升可扩展性的关键策略。本文提出了一种名为BLEST-ML(通过机器学习进行块大小估计)的方法论,该方法基于监督学习技术进行块大小估计。通过为dislib(一个基于PyCOMPSs框架、高度聚焦机器学习算法的分布式计算库)定制实现方案,我们对该方法论进行了评估。我们针对dislib中的不同算法、数据集及基础设施(包括MareNostrum 4超级计算机)开展了大量实验评估。结果表明,BLEST-ML能够高效确定给定数据集的合理拆分方式,从而验证了其在高性能环境中实现数据并行应用高效执行的可行性。