A Decision Tree to Shepherd Scientists through Data Retrievability

Reproducibility is a crucial aspect of scientific research that involves the ability to independently replicate experimental results by analysing the same data or repeating the same experiment. Over the years, many works have been proposed to make the results of the experiments actually reproducible. However, very few address the importance of data reproducibility, defined as the ability of independent researchers to retain the same dataset used as input for experimentation. Properly addressing the problem of data reproducibility is crucial because often just providing a link to the data is not enough to make the results reproducible. In fact, also proper metadata (e.g., preprocessing instruction) must be provided to make a dataset fully reproducible. In this work, our aim is to fill this gap by proposing a decision tree to sheperd researchers through the reproducibility of their datasets. In particular, this decision tree guides researchers through identifying if the dataset is actually reproducible and if additional metadata (i.e., additional resources needed to reproduce the data) must also be provided. This decision tree will be the foundation of a future application that will automate the data reproduction process by automatically providing the necessary metadata based on the particular context (e.g., data availability, data preprocessing, and so on). It is worth noting that, in this paper, we detail the steps to make a dataset retrievable, while we will detail other crucial aspects for reproducibility (e.g., dataset documentation) in future works.

翻译：可重复性是科学研究的关键方面，涉及通过分析相同数据或重复相同实验来独立复现实验结果的能力。多年来，已有许多工作致力于使实验结果真正可重复。然而，鲜有研究关注数据可重复性——即独立研究者能够保留作为实验输入所用相同数据集的能力。妥善解决数据可重复性问题至关重要，因为仅提供数据链接往往不足以实现结果的可重复性。事实上，必须同时提供恰当的元数据（例如预处理指令）才能使数据集完全可重复。本研究旨在通过提出一个决策树来弥合这一空白，引导研究者实现数据集的可重复性。具体而言，该决策树可引导研究者判断数据集是否真正可重复，以及是否需要额外提供元数据（即复现数据所需的附加资源）。该决策树将作为未来应用的基础，用于根据具体情境（如数据可用性、数据预处理等）自动提供必要的元数据以自动化数据复现过程。值得注意的是，本文详细阐述了使数据集可检索的步骤，而其他可重复性关键方面（如数据集文档）将在后续工作中详述。

相关内容

再现性

关注 0

在计算机科学中，再现性是指只要程序执行时的环境和初始条件相同，当程序重复执行时，不论它是从头到尾不停顿地执行，还是“停停走走”地执行，都将获得相同的结果。再现性是程序是否可以并行执行重要的准则之一。广义上，再现性：在改变了的测量条件下，对同一被测量的测量结果之间的一致性，称为测量结果的再现性。再现性又称为复现性、重现性。