Fingerprinting and Building Large Reproducible Datasets

Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies. In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted. We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.

翻译：获取相关数据集是开展软件工程实证研究的核心。然而，在软件仓库挖掘领域，大规模挖掘任务缺乏合适的工具支持，阻碍了新数据集的创建。此外，数据源随时间动态变化（如代码库）以及提取流程文档缺失等问题，导致数据集难以长期复现。这些现象威胁着实证研究的质量与可重现性。本文提出一种工具支持的方法，在确保数据集可重现性的同时，促进大规模定制化数据集的构建。我们利用软件遗产（Software Heritage）只增档案库中所有可通过统一编程接口访问的数据源，勾勒出可重现的通用提取流程。我们提出了一种定义数据集唯一标识符的方法，当该标识符被输入提取流程时，将确保提取出相同的数据集。通过原型系统实现，我们验证了该方法的可行性，并展示了它如何帮助研究人员减轻创建或复现数据集时面临的限制。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日