Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software or Report), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample) would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility.
翻译:研究数据通常在期刊发表时被公开,以支持结果验证和可复现性。为此,研究传播基础设施通常需支持来自众多学科的不同数据集,包括表格数据、程序代码以及音视频文件。元数据(即关于数据的数据)对于确保研究成果被充分记录并符合FAIR原则至关重要。为促进关于研究成果元数据开发的讨论,我进行了一项探索性分析,旨在确定研究者自然共同提交的数据集如何聚类。我以哈佛Dataverse研究数据存储库中超过4万个数据集的内容作为聚类分析样本。研究发现,大多数聚类由单一类型数据集构成,而其余样本中无法识别出有意义的聚类。在结果解读阶段,我采用领先学术记录文档组织DataCite所使用的元数据标准,将现有资源类型映射至我的分析结果。约65%的样本可用单一类型元数据(如数据集、软件或报告)描述,其余则需聚合元数据类型。尽管DataCite支持如“集合”等聚合类型,但我认为大量数据集(尤其是同时包含数据和代码文件的样本,约占20%)更应被准确描述为“可复现资源”元数据类型。此类资源类型将特别有助于促进研究可复现性。