Modern Deep Learning (DL) architectures based on transformers (e.g., BERT, RoBERTa) are exhibiting performance improvements across a number of natural language tasks. While such DL models have shown tremendous potential for use in software engineering applications, they are often hampered by insufficient training data. Particularly constrained are applications that require project-specific data, such as bug localization, which aims at recommending code to fix a newly submitted bug report. Deep learning models for bug localization require a substantial training set of fixed bug reports, which are at a limited quantity even in popular and actively developed software projects. In this paper, we examine the effect of using synthetic training data on transformer-based DL models that perform a more complex variant of bug localization, which has the goal of retrieving bug-inducing changesets for each bug report. To generate high-quality synthetic data, we propose novel data augmentation operators that act on different constituent components of bug reports. We also describe a data balancing strategy that aims to create a corpus of augmented bug reports that better reflects the entire source code base, because existing bug reports used as training data usually reference a small part of the code base.
翻译:现代基于Transformer(例如BERT、RoBERTa)的深度学习架构在多项自然语言处理任务中展现出性能提升。尽管此类深度学习模型在软件工程应用中显示出巨大潜力,但它们常因训练数据不足而受限。尤其受限于需要特定项目数据的应用场景,例如旨在为新提交的缺陷报告推荐修复代码的缺陷定位。缺陷定位的深度学习模型需要大量已修复缺陷报告的训练集,但即便在流行且活跃开发的软件项目中,这类报告的数量也极为有限。本文探讨了使用合成训练数据对基于Transformer的深度学习模型的影响,这些模型执行一种更复杂的缺陷定位变体——其目标是为每个缺陷报告检索引发缺陷的变更集。为生成高质量的合成数据,我们提出了作用于缺陷报告不同构成组件的新型数据增强算子。我们还描述了一种数据平衡策略,旨在构建一个更能反映整个源代码库的增强缺陷报告语料库,因为用作训练数据的现有缺陷报告通常仅引用代码库的极小部分。