Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies. In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted. We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.
翻译:获取相关数据集是开展软件工程实证研究的核心。然而,在软件仓库挖掘领域,大规模挖掘任务缺乏合适的工具支持,阻碍了新数据集的创建。此外,数据源随时间动态变化(如代码库)以及提取流程文档缺失等问题,导致数据集难以长期复现。这些现象威胁着实证研究的质量与可重现性。本文提出一种工具支持的方法,在确保数据集可重现性的同时,促进大规模定制化数据集的构建。我们利用软件遗产(Software Heritage)只增档案库中所有可通过统一编程接口访问的数据源,勾勒出可重现的通用提取流程。我们提出了一种定义数据集唯一标识符的方法,当该标识符被输入提取流程时,将确保提取出相同的数据集。通过原型系统实现,我们验证了该方法的可行性,并展示了它如何帮助研究人员减轻创建或复现数据集时面临的限制。