Buggin：基于自然语言处理与机器学习的固有缺陷自动分类模型 (Buggin: Automatic intrinsic bugs classification model using NLP and ML)

Recent studies have shown that bugs can be categorized into intrinsic and extrinsic types. Intrinsic bugs can be backtracked to specific changes in the version control system (VCS), while extrinsic bugs originate from external changes to the VCS and lack a direct bug-inducing change. Using only intrinsic bugs to train bug prediction models has been reported as beneficial to improve the performance of such models. However, there is currently no automated approach to identify intrinsic bugs. To bridge this gap, our study employs Natural Language Processing (NLP) techniques to automatically identify intrinsic bugs. Specifically, we utilize two embedding techniques, seBERT and TF-IDF, applied to the title and description text of bug reports. The resulting embeddings are fed into well-established machine learning algorithms such as Support Vector Machine, Logistic Regression, Decision Tree, Random Forest, and K-Nearest Neighbors. The primary objective of this paper is to assess the performance of various NLP and machine learning techniques in identifying intrinsic bugs using the textual information extracted from bug reports. The results demonstrate that both seBERT and TF-IDF can be effectively utilized for intrinsic bug identification. The highest performance scores were achieved by combining TF-IDF with the Decision Tree algorithm and utilizing the bug titles (yielding an F1 score of 78%). This was closely followed by seBERT, Support Vector Machine, and bug titles (with an F1 score of 77%). In summary, this paper introduces an innovative approach that automates the identification of intrinsic bugs using textual information derived from bug reports.

翻译：近期研究表明，软件缺陷可分为固有缺陷与外部缺陷两类。固有缺陷可追溯至版本控制系统中的特定变更，而外部缺陷源于版本控制系统的外部变更，且缺乏直接的缺陷诱发变更。已有研究报道，仅使用固有缺陷训练缺陷预测模型有助于提升此类模型的性能。然而，目前尚缺乏自动识别固有缺陷的方法。为填补这一空白，本研究采用自然语言处理技术自动识别固有缺陷。具体而言，我们运用两种嵌入技术——seBERT与TF-IDF，对缺陷报告的标题与描述文本进行处理。生成的嵌入向量被输入至成熟的机器学习算法中，包括支持向量机、逻辑回归、决策树、随机森林及K近邻算法。本文的主要目标是通过从缺陷报告中提取的文本信息，评估多种自然语言处理与机器学习技术在识别固有缺陷方面的性能。实验结果表明，seBERT与TF-IDF均可有效用于固有缺陷识别。最佳性能指标由TF-IDF结合决策树算法并利用缺陷标题实现（获得78%的F1分数），紧随其后的是seBERT结合支持向量机与缺陷标题的方案（获得77%的F1分数）。综上所述，本文提出了一种创新方法，能够利用缺陷报告的文本信息实现固有缺陷的自动识别。