Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.
翻译:在深度学习——大规模数据科学的主流方法中,管理和准备复杂数据颇具挑战性。用于模型训练的数据传输也存在困难,影响着基因组学、气候建模和天文学等科学领域。像Google Pathways这样为深度学习模型提供分布式执行环境的大规模解决方案虽然存在,但属于专有技术。将现有的开源、可扩展运行时工具与高性能计算平台上的数据框架相集成,对于应对这些挑战至关重要。我们的目标是建立一种平滑且统一的方法,将数据工程和深度学习框架与多样化的执行能力相结合,可部署于包括云平台和超级计算机在内的各种高性能计算环境中。我们旨在支持带有加速器的异构系统,使Cylon及其他数据工程和深度学习框架能够利用异构执行能力。为此,我们提出了Radical-Cylon——一个采用并行分布式数据框架的异构运行时系统,将Cylon作为Radical Pilot的任务来执行。我们详细阐述了Radical-Cylon的设计与开发过程,以及利用Radical Pilot执行Cylon任务的流程。该方法支持在多个节点间使用异构MPI通信器。Radical-Cylon在保持恒定且极低开销的前提下,性能优于裸机Cylon。在与批处理执行相同的资源条件下,对3500万和35亿行数据进行相似连接与排序操作时,Radical-Cylon的执行速度实现了(4~15)%的提升。该方法旨在在科学与工程研究的高性能计算系统中表现出色,同时在云基础设施上保持稳定性能。这种双重能力促进了开源科学研究社区的合作与创新。