Critical goals of scientific computing are to increase scientific rigor, reproducibility, and transparency while keeping up with ever-increasing computational demands. This work presents an integrated framework well-suited for data processing and analysis spanning individual, on-premises, and cloud environments. This framework leverages three well-established DevOps tools: 1) Git repositories linked to 2) CI/CD engines operating on 3) containers. It supports the full life-cycle of scientific data workflows with minimal friction between stages--including solutions for researchers who generate data. This is achieved by leveraging a single container that supports local, interactive user sessions and deployment in HPC or Kubernetes clusters. Combined with Git repositories integrated with CI/CD, this approach enables decentralized data pipelines across multiple, arbitrary computational environments. This framework has been successfully deployed and validated within our research group, spanning experimental acquisition systems and computational clusters with open-source, purpose-built GitLab CI/CD executors for slurm and Google Kubernetes Engine Autopilot. Taken together, this framework can increase the rigor, reproducibility, and transparency of compute-dependent scientific research.
翻译:科学计算的关键目标是在应对日益增长的计算需求的同时,提升科学严谨性、可重复性和透明度。本文提出了一种适用于个人环境、本地部署环境及云端环境的数据处理与分析集成框架。该框架基于三种成熟的DevOps工具:1)与持续集成/持续部署(CI/CD)引擎关联的Git仓库,2)运行于容器之上的CI/CD引擎,3)容器技术。它支持科学数据工作流的全生命周期,并最大程度减少各阶段间的摩擦——包括为生成数据的研究人员提供解决方案。这一目标通过构建支持本地交互式会话及HPC或Kubernetes集群部署的单一容器实现。结合集成CI/CD的Git仓库,该方法能够在多个任意计算环境中构建去中心化数据管道。该框架已在我们研究团队中成功部署并验证,涵盖实验采集系统与计算集群,采用面向Slurm和Google Kubernetes Engine Autopilot开源定制开发的GitLab CI/CD执行器。综上,该框架能够提升依赖计算的科学研究的严谨性、可重复性和透明度。