Bug localization refers to the identification of source code files which is in a programming language and also responsible for the unexpected behavior of software using the bug report, which is a natural language. As bug localization is labor-intensive, bug localization models are employed to assist software developers. Due to the domain difference between source code files and bug reports, modern bug-localization systems, based on deep learning models, rely heavily on embedding techniques that project bug reports and source code files into a shared vector space. The creation of an embedding involves several design choices, but the impact of these choices on the quality of embedding and the performance of bug localization models remains unexplained in current research. To address this gap, our study evaluated 14 distinct embedding models to gain insights into the effects of various design choices. Subsequently, we developed bug localization models utilizing these embedding models to assess the influence of these choices on the performance of the localization models. Our findings indicate that the pre-training strategies significantly affect the quality of the embedding. Moreover, we discovered that the familiarity of the embedding models with the data has a notable impact on the bug localization model's performance. Notably, when the training and testing data are collected from different projects, the performance of the bug localization models exhibits substantial fluctuations.
翻译:缺陷定位是指利用自然语言编写的缺陷报告,识别出编程语言编写的、导致软件异常行为的源代码文件。由于缺陷定位过程劳动密集,缺陷定位模型被用于辅助软件开发人员。鉴于源代码文件与缺陷报告之间存在领域差异,基于深度学习模型的现代缺陷定位系统严重依赖于将缺陷报告和源代码文件映射到共享向量空间的嵌入技术。构建嵌入模型涉及多项设计选择,但这些选择对嵌入质量及缺陷定位模型性能的影响在当前研究中尚未得到充分阐释。为填补这一空白,本研究评估了14种不同的嵌入模型,以探究各类设计选择的影响。随后,我们利用这些嵌入模型构建缺陷定位模型,评估这些选择对定位模型性能的影响。研究结果表明,预训练策略对嵌入质量具有显著影响。此外,我们发现嵌入模型对数据的熟悉程度会显著影响缺陷定位模型的性能。值得注意的是,当训练数据与测试数据来自不同项目时,缺陷定位模型的性能会出现大幅波动。