It is generally desirable for high-performance computing (HPC) applications to be portable between HPC systems, for example to make use of more performant hardware, make effective use of allocations, and to co-locate compute jobs with large datasets. Unfortunately, moving scientific applications between HPC systems is challenging for various reasons, most notably that HPC systems have different HPC schedulers. We introduce PSI/J, a job management abstraction API intended to simplify the construction of software components and applications that are portable over various HPC scheduler implementations. We argue that such a system is both necessary and that no viable alternative currently exists. We analyze similar notable APIs and attempt to determine the factors that influenced their evolution and adoption by the HPC community. We base the design of PSI/J on that analysis. We describe how PSI/J has been integrated in three workflow systems and one application, and also show via experiments that PSI/J imposes minimal overhead.
翻译:高性能计算(HPC)应用在不同HPC系统间具备可移植性通常具有重要价值,例如利用性能更强的硬件、高效使用计算资源配额,以及将计算作业与大型数据集协同部署。然而,由于各HPC系统采用不同的调度器,将科学应用迁移至不同HPC系统面临诸多挑战。本文提出PSI/J——一种作业管理抽象接口,旨在简化可移植至多种HPC调度器实现的软件组件与应用的构建过程。我们认为此类系统具有必要性,且当前尚不存在可行的替代方案。通过分析相似的主流接口,我们尝试探究影响其在HPC社区演进与采纳的关键因素,并基于此分析设计了PSI/J。我们展示了PSI/J在三个工作流系统与一个应用中的集成实践,并通过实验证明PSI/J产生的性能开销极低。