Investigating Reproducibility in Deep Learning-Based Software Fault Prediction

Over the past few years, deep learning methods have been applied for a wide range of Software Engineering (SE) tasks, including in particular for the important task of automatically predicting and localizing faults in software. With the rapid adoption of increasingly complex machine learning models, it however becomes more and more difficult for scholars to reproduce the results that are reported in the literature. This is in particular the case when the applied deep learning models and the evaluation methodology are not properly documented and when code and data are not shared. Given some recent -- and very worrying -- findings regarding reproducibility and progress in other areas of applied machine learning, the goal of this work is to analyze to what extent the field of software engineering, in particular in the area of software fault prediction, is plagued by similar problems. We have therefore conducted a systematic review of the current literature and examined the level of reproducibility of 56 research articles that were published between 2019 and 2022 in top-tier software engineering conferences. Our analysis revealed that scholars are apparently largely aware of the reproducibility problem, and about two thirds of the papers provide code for their proposed deep learning models. However, it turned out that in the vast majority of cases, crucial elements for reproducibility are missing, such as the code of the compared baselines, code for data pre-processing or code for hyperparameter tuning. In these cases, it therefore remains challenging to exactly reproduce the results in the current research literature. Overall, our meta-analysis therefore calls for improved research practices to ensure the reproducibility of machine-learning based research.

翻译：过去几年中，深度学习方法已被广泛应用于各类软件工程任务，特别是软件故障自动预测与定位这一重要领域。然而，随着日益复杂机器学习模型的快速普及，学者们越来越难以复现文献中报道的研究结果。当所使用的深度学习模型和评估方法缺乏规范文档记录、且代码与数据未公开共享时，这一问题尤为突出。鉴于近期在应用机器学习其他领域关于可复现性和研究进展方面出现的若干令人担忧的发现，本研究旨在分析软件工程领域（尤其是软件故障预测方向）是否也面临类似问题。为此，我们对现有文献开展了系统性综述，考察了2019至2022年间发表于顶级软件工程会议的56篇研究论文的可复现性水平。分析表明：学者们显然已普遍意识到可复现性问题，约三分之二的论文提供了所提出深度学习模型的代码。然而，在绝大多数案例中，可复现性的关键要素仍存在缺失，例如基准对比方法的代码、数据预处理代码或超参数调优代码。因此，在当前研究文献中精确复现其结果仍具有挑战性。总体而言，本元分析呼吁改进研究实践，以确保基于机器学习研究的可复现性。