It is generally desirable for high-performance computing (HPC) applications to be portable between HPC systems, for example to make use of more performant hardware, make effective use of allocations, and to co-locate compute jobs with large datasets. Unfortunately, moving scientific applications between HPC systems is challenging for various reasons, most notably that HPC systems have different HPC schedulers. We introduce PSI/J, a job management abstraction API intended to simplify the construction of software components and applications that are portable over various HPC scheduler implementations. We argue that such a system is both necessary and that no viable alternative currently exists. We analyze similar notable APIs and attempt to determine the factors that influenced their evolution and adoption by the HPC community. We base the design of PSI/J on that analysis. We describe how PSI/J has been integrated in three workflow systems and one application, and also show via experiments that PSI/J imposes minimal overhead.
翻译:高性能计算(HPC)应用通常需要具备跨HPC系统的可移植性,例如利用更高性能的硬件、有效使用计算资源配额,以及将计算任务与大型数据集就近部署。然而,由于HPC系统采用不同的调度器,将科学应用迁移至不同系统面临诸多挑战。本文介绍PSI/J——一种作业管理抽象API,旨在简化可跨多种HPC调度器实现的可移植软件组件与应用构建。我们论证了此类系统的必要性,并指出现有方案尚无法满足需求。通过分析类似知名API,我们尝试确定影响其在高性能计算社区中演进与采纳的关键因素。PSI/J的设计正是基于这一分析。我们描述了PSI/J在三个工作流系统和一个应用中的集成情况,并通过实验证明PSI/J带来的额外开销极小。