Defect Category Prediction Based on Multi-Source Domain Adaptation

from arxiv, 17 pages, in Chinese language, 8 figures (Due to length constraints of the abstract field, please refer to the original PDF file for the full content of abstract.)

In recent years, defect prediction techniques based on deep learning have become a prominent research topic in the field of software engineering. These techniques can identify potential defects without executing the code. However, existing approaches mostly concentrate on determining the presence of defects at the method-level code, lacking the ability to precisely classify specific defect categories. Consequently, this undermines the efficiency of developers in locating and rectifying defects. Furthermore, in practical software development, new projects often lack sufficient defect data to train high-accuracy deep learning models. Models trained on historical data from existing projects frequently struggle to achieve satisfactory generalization performance on new projects. Hence, this paper initially reformulates the traditional binary defect prediction task into a multi-label classification problem, employing defect categories described in the Common Weakness Enumeration (CWE) as fine-grained predictive labels. To enhance the model performance in cross-project scenarios, this paper proposes a multi-source domain adaptation framework that integrates adversarial training and attention mechanisms. Specifically, the proposed framework employs adversarial training to mitigate domain (i.e., software projects) discrepancies, and further utilizes domain-invariant features to capture feature correlations between each source domain and the target domain. Simultaneously, the proposed framework employs a weighted maximum mean discrepancy as an attention mechanism to minimize the representation distance between source and target domain features, facilitating model in learning more domain-independent features. The experiments on 8 real-world open-source projects show that the proposed approach achieves significant performance improvements compared to state-of-the-art baselines.

翻译：近年来，基于深度学习的缺陷预测技术已成为软件工程领域的重要研究课题。这些技术无需执行代码即可识别潜在缺陷。然而，现有方法大多集中于判定方法级代码中是否存在缺陷，缺乏对具体缺陷类别的精确分类能力，从而降低了开发人员定位和修复缺陷的效率。此外，在实际软件开发现实中，新项目往往缺乏足够的缺陷数据来训练高精度的深度学习模型。基于现有项目历史数据训练的模型，在新项目上常难以获得满意的泛化性能。为此，本文首先将传统的二分类缺陷预测任务重新定义为多标签分类问题，采用通用弱点枚举（CWE）中描述的缺陷类别作为细粒度预测标签。为提升模型在跨项目场景下的性能，本文提出了一种融合对抗训练与注意力机制的多源域适应框架。具体而言，该框架采用对抗训练来缓解域（即软件项目）之间的差异，并利用域不变特征捕捉每个源域与目标域之间的特征相关性。同时，该框架采用加权最大均值差异作为注意力机制，以最小化源域与目标域特征之间的表征距离，促进模型学习更多域无关特征。在8个真实开源项目上的实验表明，与最先进的基线方法相比，所提方法获得了显著的性能提升。