机器学习实验在软件缺陷预测领域的审计 (An Audit of Machine Learning Experiments on Software Defect Prediction)

Background: Machine learning algorithms are widely used to predict defect prone software components. In this literature, computational experiments are the main means of evaluation, and the credibility of results depends on experimental design and reporting. Objective: This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices against accepted norms from statistics, machine learning, and empirical software engineering. The aim is to characterise current practice and assess the reproducibility of published results. Method: We audited SDP studies indexed in SCOPUS between 2019 and 2023, focusing on design and analysis choices such as outcome measures, out of sample validation strategies, and the use of statistical inference. Nine study issues were evaluated. Reproducibility was assessed using the instrument proposed by González Barahona and Robles. Results: The search identified approximately 1,585 SDP experiments published during the period. From these, we randomly sampled 101 papers, including 61 journal and 40 conference publications, with almost 50 percent behind paywalls. We observed substantial variation in research practice. The number of datasets ranged from 1 to 365, learners or learner variants from 1 to 34, and performance measures from 1 to 9. About 45 percent of studies applied formal statistical inference. Across the sample, we identified 427 issues, with a median of four per paper, and only one paper without issues. Reproducibility ranged from near complete to severely limited. We also identified two cases of tortured phrases and possible paper mill activity. Conclusions: Experimental design and reporting practices vary widely, and almost half of the studies provide insufficient detail to support reproduction. The audit indicates substantial scope for improvement.

翻译：背景：机器学习算法被广泛应用于预测易产生缺陷的软件组件。在该研究领域中，计算实验是主要的评估手段，其结果的可信度取决于实验设计与报告的质量。目的：本文通过依据统计学、机器学习和实证软件工程领域的公认规范，评估近期软件缺陷预测（SDP）研究的实验设计、分析与报告实践，旨在描述当前实践现状并评估已发表结果的可复现性。方法：我们审计了2019年至2023年间收录于SCOPUS的SDP研究，重点关注其设计与分析选择，例如结果度量指标、样本外验证策略以及统计推断的使用。共评估了九个研究问题。可复现性使用González Barahona和Robles提出的工具进行评估。结果：检索确定了该期间发表的约1,585项SDP实验。从中，我们随机抽样了101篇论文，包括61篇期刊论文和40篇会议论文，其中近50%为付费墙后内容。我们观察到研究实践存在显著差异。数据集数量从1到365不等，学习器或其变体从1到34种，性能度量指标从1到9种。约45%的研究应用了正式的统计推断。在整个样本中，我们识别出427个问题，每篇论文的中位问题数为4个，仅有一篇论文未发现问题。可复现性范围从近乎完整到严重受限。我们还识别出两例存在"受折磨短语"及可能的论文工厂活动的情况。结论：实验设计与报告实践差异巨大，近半数研究提供的细节不足以支持复现。审计结果表明存在巨大的改进空间。