COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among domain experts, mathematical modelers, and scientific computing specialists. Computationally, however, it also revealed critical gaps in the ability of researchers to exploit advanced computing systems. These challenging areas include gaining access to scalable computing systems, porting models and workflows to new systems, sharing data of varying sizes, and producing results that can be reproduced and validated by others. Informed by our team's work in supporting public health decision makers during the COVID-19 pandemic and by the identified capability gaps in applying high-performance computing (HPC) to the modeling of complex social systems, we present the goals, requirements, and initial implementation of OSPREY, an open science platform for robust epidemic analysis. The prototype implementation demonstrates an integrated, algorithm-driven HPC workflow architecture, coordinating tasks across federated HPC resources, with robust, secure and automated access to each of the resources. We demonstrate scalable and fault-tolerant task execution, an asynchronous API to support fast time-to-solution algorithms, an inclusive, multi-language approach, and efficient wide-area data management. The example OSPREY code is made available on a public repository.
翻译:COVID-19对科学合作产生了前所未有的影响。这场大流行及科学界的广泛响应,在领域专家、数学建模专家和科学计算专家之间建立了新型合作关系。然而在计算层面,疫情也暴露了研究人员利用先进计算系统能力的关键短板。这些挑战包括:获取可扩展计算系统的途径、将模型与工作流程迁移至新系统、不同规模数据的共享,以及产出可被他人复现与验证的结果。基于本团队在COVID-19疫情期间支持公共卫生决策者的实践经验,以及识别出的将高性能计算(HPC)应用于复杂社会系统建模的能力缺口,我们提出了OSPREY(面向稳健流行病分析的开源科学平台)的目标、需求及初步实现方案。该原型系统展示了集成化、算法驱动的HPC工作流架构,可协调联邦化HPC资源间的任务,并为每项资源提供稳健、安全且自动化的访问接口。我们验证了可扩展且容错的任务执行机制、支持快速收敛算法的异步API、包容多语言的实现方式,以及高效的广域数据管理能力。OSPREY的示例代码已托管于公共代码仓库。