Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks

Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated execution to approximate the notebook's original computational context. We evaluate the approach on 443 notebooks from 116 GitHub repositories referenced by publications in PubMed Central. Execution outcomes are classified into four categories: resolved environment failures, persistent logic or data errors, reproducibility drift, and container-induced regressions. Our results show that containerization resolves 66.7% of prior dependency-related failures and substantially improves execution robustness. However, a significant reproducibility gap remains: 53.7% of notebooks exhibit low output fidelity, largely due to persistent runtime failures and stochastic non-determinism. These findings indicate that standardized containerization is essential for computational stability but insufficient for full bit-wise reproducibility. The framework offers a scalable solution for researchers, editors, and archivists seeking systematic, automated assessment of computational artifacts.

翻译：计算可复现性是可信科学的基础，但在各类研究工作流中（包括与学术论文共同发布的Jupyter笔记本）仍难以实现。环境漂移、未记录的依赖项和隐含执行假设常常阻碍已发布研究的独立重复执行。尽管已有可复现性指南，但面向自动化评估的可扩展系统化基础设施仍十分有限。我们提出了一种面向网页的自动化可复现性工程流水线，用于重建并评估学术笔记本的仓库级执行环境。该系统通过依赖推断、自动容器生成和隔离执行，近似还原笔记本的原始计算上下文。我们在PubMed Central出版物引用的116个GitHub仓库中的443个笔记本上评估了该方法，将执行结果分为四类：已解决的环境故障、持久的逻辑或数据错误、可复现性漂移以及容器引发的回归。结果表明，容器化解决了66.7%的先前依赖相关故障，并显著提升了执行鲁棒性。然而，可复现性差距仍然显著：53.7%的笔记本输出保真度较低，主要归因于持续的运行时故障和随机非确定性。这些发现表明，标准化容器化对计算稳定性至关重要，但不足以实现完全的逐位可复现性。该框架为寻求系统化自动化评估计算产物的研究人员、编辑和档案管理员提供了可扩展的解决方案。