Datasets collecting software mentions from scholarly publications can potentially be used for research into the software that has been used in the published research, as well as into the practice of software citation. Recently, new software mention datasets with different characteristics have been published. We present an approach to assess the usability of such datasets for research on research software. Our approach includes sampling and data preparation, manual annotation for quality and mention characteristics, and annotation analysis. We applied it to two software mention datasets for evaluation based on qualitative observation. Doing this, we were able to find challenges to working with the selected datasets to do research. Main issues refer to the structure of the dataset, the quality of the extracted mentions (54% and 23% of mentions respectively are not to software), and software accessibility. While one dataset does not provide links to mentioned software at all, the other does so in a way that can impede quantitative research endeavors: (1) Links may come from different sources and each point to different software for the same mention. (2) The quality of the automatically retrieved links is generally poor (in our sample, 65.4% link the wrong software). (3) Links exist only for a small subset (in our sample, 20.5%) of mentions, which may lead to skewed or disproportionate samples. However, the greatest challenge and underlying issue in working with software mention datasets is the still suboptimal practice of software citation: Software should not be mentioned, it should be cited following the software citation principles.
翻译:摘要:从学术文献中提取软件提及的数据集,可用于研究已发表成果中使用的软件,以及软件引用实践。近年来,具有不同特征的新型软件提及数据集相继发布。我们提出了一种评估此类数据集在研究软件领域可用性的方法,包括抽样与数据准备、质量与提及特征的人工标注、以及标注分析。通过基于定性观察对两个软件提及数据集进行应用评估,我们发现了使用所选数据集开展研究时存在的挑战。主要问题涉及数据集结构、提取提及的质量(分别有54%和23%的提及并非指向软件)以及软件可获取性。一个数据集完全未提供提及软件的链接,另一个虽提供链接却可能阻碍定量研究:(1)链接可能来自不同来源,同一提及项可能指向不同软件;(2)自动获取的链接质量普遍较低(样本中65.4%错误关联软件);(3)仅针对小部分提及项存在链接(样本中占20.5%),可能导致样本偏差或比例失衡。然而,使用软件提及数据集的最大挑战及根本问题,仍在于软件引用的实践尚不完善:软件不应只是被提及,而应遵循软件引用原则进行规范引用。