Software is an important tool for scholarly work, but software produced for research is in many cases not easily identifiable or discoverable. A potential first step in linking research and software is software identification. In this paper we present two datasets to study the identification and production of research software. The first dataset contains almost 1000 human labeled annotations of software production from National Science Foundation (NSF) awarded research projects. We use this dataset to train models that predict software production. Our second dataset is created by applying the trained predictive models across the abstracts and project outcomes reports for all NSF funded projects between the years of 2010 and 2023. The result is an inferred dataset of software production for over 150,000 NSF awards. We release the Soft-Search dataset to aid in identifying and understanding research software production: https://github.com/si2-urssi/eager
翻译:软件是学术工作的重要工具,但研究过程中产生的软件在许多情况下难以被识别或发现。连接研究与软件的潜在首要步骤是软件识别。本文提出两个数据集用于研究研究软件的识别与生成。第一个数据集包含近1000条人工标注的软件生成注释,这些标注来自美国国家科学基金会(NSF)资助的研究项目。我们利用该数据集训练预测软件生成的模型。第二个数据集通过将训练好的预测模型应用于2010年至2023年间所有NSF资助项目的摘要和项目成果报告创建而成,最终得到一个涵盖超过15万个NSF奖项的软件生成推断数据集。我们发布Soft-Search数据集以帮助识别和理解研究软件生成,数据集地址:https://github.com/si2-urssi/eager