Empirical studies of research software are hard to compare because the literature operationalizes ``research software'' inconsistently. Motivated by the research software supply chain (RSSC) and its security risks, we introduce an RSSC-oriented taxonomy that makes scope and operational boundaries explicit for empirical research software security studies. We conduct a targeted scoping review of recent repository mining and dataset construction studies, extracting each work's definition, inclusion criteria, unit of analysis, and identification heuristics. We synthesize these into a harmonized taxonomy and a mapping that translates prior approaches into shared taxonomy dimensions. We operationalize the taxonomy on a large community-curated corpus from the Research Software Encyclopedia (RSE), producing an annotated dataset, a labeling codebook, and a reproducible labeling pipeline. Finally, we apply OpenSSF Scorecard as a preliminary security analysis to show how repository-centric security signals differ across taxonomy-defined clusters and why taxonomy-aware stratification is necessary for interpreting RSSC security measurements.
翻译:由于文献中对“研究软件”的操作化定义不一致,实证研究软件的相关研究难以进行比较。受研究软件供应链及其安全风险的驱动,我们提出了一种面向RSSC的分类法,为实证研究软件安全研究明确了研究范围和操作边界。我们对近期的仓库挖掘与数据集构建研究进行了针对性范围综述,提取了每项工作的定义、纳入标准、分析单元和识别启发式方法。我们将这些要素综合成一个统一分类法及映射框架,将既有方法转化为共享的分类维度。我们在研究软件百科全书的大型社区策展语料库上实现了该分类法的操作化,生成了标注数据集、标注代码簿和可复现的标注流程。最后,我们运用OpenSSF Scorecard进行初步安全分析,展示了以代码仓库为中心的安全信号在分类法定义的不同集群中的差异,并阐释了为何基于分类法的分层对于解读RSSC安全度量至关重要。