Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.
翻译:为深度学习管理和准备复杂数据——这一大规模数据科学中的主流方法——颇具挑战性。模型训练的数据传输同样存在困难,这一瓶颈影响基因组学、气候建模和天文学等科学领域。虽然Google Pathways这类大规模解决方案提供了专用于深度学习模型的分布式执行环境,但其属于专有技术。在高性能计算(HPC)平台上整合现有开源的可扩展运行时工具与数据框架,是应对这些挑战的关键。我们的目标是建立一种平滑统一的方法,将数据工程与深度学习框架相结合,使其具备多样化的执行能力,并可部署于包括云平台和超级计算机在内的各类高性能计算设施。我们致力于支持包含加速器的异构系统,使Cylon及其他数据工程与深度学习框架能够利用异构执行能力。为此,我们提出Radical-Cylon——一种集成并行分布式数据框架的异构运行时系统,可将Cylon作为Radical Pilot的任务进行执行。我们详细阐释了Radical-Cylon的设计与开发流程,以及通过Radical Pilot执行Cylon任务的过程。该方法支持跨多节点使用异构MPI通信器。Radical-Cylon在保持恒定最小开销的前提下,实现了优于裸机Cylon的性能。在相同资源条件下处理3500万行与35亿行数据的类似连接与排序操作时,Radical-Cylon的执行速度比批处理方式快4%~15%。该方法专为在科学与工程研究HPC系统中实现卓越性能而设计,同时在云基础设施上展现出稳健运行能力。这种双重能力将促进开源科学研究社区的协作与创新。