This paper proposes an improved approach to the analysis of Mining Software Repositories (MSR) datasets via metadata enrichment, FAIRness assessment, and topic-driven analysis. This research expands upon an earlier dataset directory created specifically for the analysis of MSR datasets by adding new annotations to the datasets, enriching the metadata categories, and offering more advanced filtering options. The metadata of the MSR papers presented from 2013 to 2024 has been gathered using the Semantic Scholar API. The analysis is based on Latent Dirichlet Allocation (LDA) topic modeling and statistical analysis. Dataset-level attributes were included into the expanded dataset directory, namely repository hosting site, format, accessibility, reusability, and dataset quality. The study reveals that the choice of repository hosting sites and data formats influences citation patterns and dataset usability. Furthermore, the enhanced annotation approach improves the analysis and discoverability of MSR datasets, supporting more effective reuse and evaluation of research artifacts.
翻译:本文提出一种通过元数据丰富化、FAIR原则评估及主题驱动分析来改进软件仓库挖掘(MSR)数据集分析的方法。本研究在专门用于分析MSR数据集的早期数据集目录基础上进行扩展,通过为数据集添加新标注、丰富元数据类别并提供更高级的筛选选项。研究利用Semantic Scholar API收集了2013至2024年间发表的MSR论文元数据,分析基于潜在狄利克雷分配(LDA)主题建模与统计分析。扩展后的数据集目录纳入了数据集层级属性,包括存储库托管站点、格式、可访问性、可复用性及数据集质量。研究表明,存储库托管站点与数据格式的选择会影响引用模式及数据集可用性。此外,增强的标注方法提升了MSR数据集的分析性能与可发现性,从而促进研究工件的有效复用与评估。