Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms are crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.
翻译:管理和准备用于深度学习的复杂数据,作为大规模数据科学中的常用方法可能充满挑战。模型训练的数据传输同样存在困难,对基因组学、气候建模和天文学等科学领域产生影响。像Google Pathways这样具备深度学习分布式执行环境的大规模解决方案虽然存在,但属于专有技术。整合高性能计算平台上现有的开源、可扩展运行时工具与数据框架,对应对这些挑战至关重要。我们的目标是建立一种平滑统一的方法,将数据工程和深度学习框架与多样化的执行能力相结合,使其能够部署在各种高性能计算平台(包括云和超级计算机)上。我们致力于支持配备加速器的异构系统,使Cylon及其他数据工程与深度学习框架能够利用异构执行能力。为此,我们提出Radical-Cylon——一种结合并行分布式数据框架的异构运行时系统,将Cylon作为Radical Pilot的任务来执行。我们详细阐述了Radical-Cylon的设计与开发过程,以及利用Radical Pilot执行Cylon任务的具体流程。该方法支持跨多个节点使用异构MPI通信器。相较于裸机Cylon,Radical-Cylon在仅引入恒定且极小的额外开销下实现了更优性能。在执行包含3500万和35亿行数据的类似连接与排序操作时,Radical-Cylon比批量执行快8%~15%。该方法旨在面向科学和工程研究的高性能计算系统取得卓越表现,同时在云基础设施上展现稳健性能。这种双重能力促进了开源科学研究社区内的协作与创新。