Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins of data products, usage patterns of datasets). Unfortunately, existing provenance solutions cannot address the challenges due to their incompatible provenance models and/or system implementations. In this paper, we analyze four representative scientific workflows in collaboration with the domain scientists to identify concrete provenance needs. Based on the first-hand analysis, we propose a provenance framework called PROV-IO+, which includes an I/O-centric provenance model for describing scientific data and the associated I/O operations and environments precisely. Moreover, we build a prototype of PROV-IO+ to enable end-to-end provenance support on real HPC systems with little manual effort. The PROV-IO+ framework can support both containerized and non-containerized workflows on different HPC platforms with flexibility in selecting various classes of provenance. Our experiments with realistic workflows show that PROV-IO+ can address the provenance needs of the domain scientists effectively with reasonable performance (e.g., less than 3.5% tracking overhead for most experiments). Moreover, PROV-IO+ outperforms a state-of-the-art system (i.e., ProvLake) in our experiments.
翻译:数据溯源描述了数据的生命周期。在高性能计算系统的科学工作流中,科学家常需获取多样化的溯源信息(例如数据产品的来源、数据集的使用模式)。然而,现有溯源方案因不兼容的溯源模型和/或系统实现,难以应对这些挑战。本文通过与领域科学家合作,分析四个代表性科学工作流,明确具体溯源需求。基于一手分析,我们提出以I/O为中心的溯源框架PROV-IO+,该框架包含精准描述科学数据及其关联I/O操作与环境的溯源模型。此外,我们构建了PROV-IO+原型系统,能以极低人工成本在真实高性能计算系统上实现端到端溯源支持。PROV-IO+框架可灵活选择不同类别的溯源信息,支持跨高性能计算平台的容器化与非容器化工作流。基于实际工作流的实验表明,PROV-IO+能有效满足领域科学家的溯源需求,且性能开销合理(多数实验追踪开销低于3.5%)。此外,实验中PROV-IO+性能优于现有先进系统ProvLake。