MIRAGE: Metadata-Integrated Repository Analysis and Guided Enhancement for MSR Datasets

This paper proposes an improved approach to the analysis of Mining Software Repositories (MSR) datasets via metadata enrichment, FAIRness assessment, and topic-driven analysis. This research expands upon an earlier dataset directory created specifically for the analysis of MSR datasets by adding new annotations to the datasets, enriching the metadata categories, and offering more advanced filtering options. The metadata of the MSR papers presented from 2013 to 2024 has been gathered using the Semantic Scholar API. The analysis is based on Latent Dirichlet Allocation (LDA) topic modeling and statistical analysis. Dataset-level attributes were included into the expanded dataset directory, namely repository hosting site, format, accessibility, reusability, and dataset quality. The study reveals that the choice of repository hosting sites and data formats influences citation patterns and dataset usability. Furthermore, the enhanced annotation approach improves the analysis and discoverability of MSR datasets, supporting more effective reuse and evaluation of research artifacts.

翻译：本文提出一种通过元数据丰富化、FAIR原则评估及主题驱动分析来改进软件仓库挖掘（MSR）数据集分析的方法。本研究在专门用于分析MSR数据集的早期数据集目录基础上进行扩展，通过为数据集添加新标注、丰富元数据类别并提供更高级的筛选选项。研究利用Semantic Scholar API收集了2013至2024年间发表的MSR论文元数据，分析基于潜在狄利克雷分配（LDA）主题建模与统计分析。扩展后的数据集目录纳入了数据集层级属性，包括存储库托管站点、格式、可访问性、可复用性及数据集质量。研究表明，存储库托管站点与数据格式的选择会影响引用模式及数据集可用性。此外，增强的标注方法提升了MSR数据集的分析性能与可发现性，从而促进研究工件的有效复用与评估。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CMU博士论文】异构数据导航：构建面向多样化数据类型、领域及复杂性的 AI 系统

专知会员服务

19+阅读 · 2月12日

【NeurIPS2025】MIDAS：一种基于错配的用于失衡多模态学习的数据增强策略

专知会员服务

10+阅读 · 2025年10月1日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日

专知会员服务

48+阅读 · 2021年10月16日

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms