Reproducibility is a crucial aspect of scientific research that involves the ability to independently replicate experimental results by analysing the same data or repeating the same experiment. Over the years, many works have been proposed to make the results of the experiments actually reproducible. However, very few address the importance of data reproducibility, defined as the ability of independent researchers to retain the same dataset used as input for experimentation. Properly addressing the problem of data reproducibility is crucial because often just providing a link to the data is not enough to make the results reproducible. In fact, also proper metadata (e.g., preprocessing instruction) must be provided to make a dataset fully reproducible. In this work, our aim is to fill this gap by proposing a decision tree to sheperd researchers through the reproducibility of their datasets. In particular, this decision tree guides researchers through identifying if the dataset is actually reproducible and if additional metadata (i.e., additional resources needed to reproduce the data) must also be provided. This decision tree will be the foundation of a future application that will automate the data reproduction process by automatically providing the necessary metadata based on the particular context (e.g., data availability, data preprocessing, and so on). It is worth noting that, in this paper, we detail the steps to make a dataset retrievable, while we will detail other crucial aspects for reproducibility (e.g., dataset documentation) in future works.
翻译:可重复性是科学研究的关键方面,涉及通过分析相同数据或重复相同实验来独立复现实验结果的能力。多年来,已有许多工作致力于使实验结果真正可重复。然而,鲜有研究关注数据可重复性——即独立研究者能够保留作为实验输入所用相同数据集的能力。妥善解决数据可重复性问题至关重要,因为仅提供数据链接往往不足以实现结果的可重复性。事实上,必须同时提供恰当的元数据(例如预处理指令)才能使数据集完全可重复。本研究旨在通过提出一个决策树来弥合这一空白,引导研究者实现数据集的可重复性。具体而言,该决策树可引导研究者判断数据集是否真正可重复,以及是否需要额外提供元数据(即复现数据所需的附加资源)。该决策树将作为未来应用的基础,用于根据具体情境(如数据可用性、数据预处理等)自动提供必要的元数据以自动化数据复现过程。值得注意的是,本文详细阐述了使数据集可检索的步骤,而其他可重复性关键方面(如数据集文档)将在后续工作中详述。