Computational reproducibility of Jupyter notebooks from biomedical publications

Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications. We address computational reproducibility at two levels: First, using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks related to publications indexed in PubMed Central. We identified such notebooks by mining the articles full text, locating them on GitHub and re-running them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. Second, this study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over two years. Out of 27271 notebooks from 2660 GitHub repositories associated with 3467 articles, 22578 notebooks were written in Python, including 15817 that had their dependencies declared in standard requirement files and that we attempted to re-run automatically. For 10388 of these, all declared dependencies could be installed successfully, and we re-ran them to assess reproducibility. Of these, 1203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions. We zoom in on common problems, highlight trends and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.

翻译：Jupyter笔记本通过将可执行代码及其文档和输出捆绑在同一个交互式环境中，成为记录和共享计算工作流的流行机制。研究计算方面的可重复性是科学可重复性的关键组成部分，但尚未大规模评估生物医学出版物相关Jupyter笔记本的可重复性。我们从两个层面解决计算可重复性问题：首先，利用全自动化工作流，分析了与PubMed Central收录出版物相关的Jupyter笔记本的计算可重复性。我们通过挖掘文章全文识别此类笔记本，定位其在GitHub上的存储库，并在尽可能接近原始环境的环境中重新运行。我们记录了复现成功与异常情况，探讨了笔记本可重复性与笔记本或出版物相关变量之间的关系。其次，本研究本身即是一次可重复性尝试，在两年内对PubMed Central数据使用相同方法重复两次。从与3467篇文章关联的2660个GitHub存储库中的27271个笔记本中，22578个笔记本使用Python编写，其中15817个在标准需求文件中声明了依赖项，我们尝试对这些笔记本进行自动重新运行。其中10388个笔记本的所有声明依赖项成功安装，我们对其重新运行以评估可重复性。在这些笔记本中，1203个完全无错误运行完成，包括879个产生与原始笔记本相同的结果，以及324个产生与原始结果不同的输出。其余笔记本运行中出现了异常。我们聚焦常见问题，突出趋势，并讨论了生物医学出版物相关Jupyter工作流的潜在改进方向。