Exascale computers offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. However, these software combinations and integrations are difficult to achieve due to the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which addresses many of these challenges. We developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms. ExaWorks also developed PSI/J, a job management abstraction API, to simplify the construction of portable software components and applications that can be used over various HPC schedulers. The PSI/J API is a minimal interface for submitting and monitoring jobs and their execution state across multiple and commonly used HPC schedulers. We also describe several leading and innovative workflow examples of ExaWorks tools used on DOE leadership platforms. Furthermore, we discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of workflows sustainably at the exascale.
翻译:百亿亿次计算机提供了变革性的能力,可将数据驱动和基于学习的方法与传统模拟应用相结合,以加速科学发现与洞察。然而,由于在多样且大规模平台上协调和部署异构软件组件面临挑战,这些软件组合与集成难以实现。本文介绍了ExaWorks项目,该项目解决了其中诸多挑战。我们开发了一个工作流软件开发工具包(SDK),这是一个经过筛选的工作流技术集合,可通过统一接口进行组合与互操作,遵循当前最佳实践进行工程化,并专门设计用于高性能计算平台。ExaWorks还开发了PSI/J——一个作业管理抽象API,以简化可跨多种HPC调度器使用的便携式软件组件与应用程序的构建。PSI/J API是一个极简接口,用于在多种常用HPC调度器上提交和监控作业及其执行状态。我们还描述了在DOE领导级平台上使用ExaWorks工具的若干前沿创新工作流实例。此外,我们讨论了本项目如何与工作流社区、大型计算设施及HPC平台供应商协作,以可持续地满足百亿亿次规模下工作流的需求。