Code revert prediction, a specialized form of software defect detection, aims to forecast or predict the likelihood of code changes being reverted or rolled back in software development. This task is very important in practice because by identifying code changes that are more prone to being reverted, developers and project managers can proactively take measures to prevent issues, improve code quality, and optimize development processes. However, compared to code defect detection, code revert prediction has been rarely studied in previous research. Additionally, many previous methods for code defect detection relied on independent features but ignored relationships between code scripts. Moreover, new challenges are introduced due to constraints in an industry setting such as company regulation, limited features and large-scale codebase. To overcome these limitations, this paper presents a systematic empirical study for code revert prediction that integrates the code import graph with code features. Different strategies to address anomalies and data imbalance have been implemented including graph neural networks with imbalance classification and anomaly detection. We conduct the experiments on real-world code commit data within J.P. Morgan Chase which is extremely imbalanced in order to make a comprehensive comparison of these different approaches for the code revert prediction problem.
翻译:代码回滚预测是软件缺陷检测的一种特殊形式,旨在预测软件开发中代码变更被回滚或撤销的可能性。该任务在实际应用中至关重要,因为通过识别更容易被回滚的代码变更,开发人员和项目经理可以主动采取措施预防问题、提升代码质量并优化开发流程。然而,与代码缺陷检测相比,以往研究对代码回滚预测的探讨较为罕见。此外,许多先前用于代码缺陷检测的方法依赖独立特征,却忽视了代码脚本之间的关联性。同时,行业环境中的约束条件(如公司规范、有限特征和大规模代码库)带来了新的挑战。为克服这些局限,本文提出了一种系统性实证研究,将代码导入图与代码特征相结合进行代码回滚预测。针对异常值和数据不平衡问题,我们实施了不同策略,包括结合不平衡分类与异常检测的图神经网络。为全面比较不同方法在代码回滚预测问题中的表现,我们使用摩根大通内部真实世界且极度不平衡的代码提交数据开展实验。