Bug localization, which is used to help programmers identify the location of bugs in source code, is an essential task in software development. Researchers have already made efforts to harness the powerful deep learning (DL) techniques to automate it. However, training bug localization model is usually challenging because it requires a large quantity of data labeled with the bug's exact location, which is difficult and time-consuming to collect. By contrast, obtaining bug detection data with binary labels of whether there is a bug in the source code is much simpler. This paper proposes a WEakly supervised bug LocaLization (WELL) method, which only uses the bug detection data with binary labels to train a bug localization model. With CodeBERT finetuned on the buggy-or-not binary labeled data, WELL can address bug localization in a weakly supervised manner. The evaluations on three method-level synthetic datasets and one file-level real-world dataset show that WELL is significantly better than the existing SOTA model in typical bug localization tasks such as variable misuse and other programming bugs.
翻译:缺陷定位是帮助程序员识别源代码中缺陷位置的关键任务,在软件开发中具有重要价值。研究者已尝试利用强大的深度学习技术实现其自动化。然而,训练缺陷定位模型通常具有挑战性,因为这需要大量标记了缺陷精确位置的数据集,且此类数据的收集既困难又耗时。相比之下,获取仅包含"是否存在缺陷"二进制标签的缺陷检测数据则简单得多。本文提出一种弱监督缺陷定位方法WELL,该方法仅使用带二进制标签的缺陷检测数据即可训练缺陷定位模型。通过基于"是否存在缺陷"的二进制标注数据微调CodeBERT,WELL能以弱监督方式实现缺陷定位。在三个方法级合成数据集和一个文件级真实数据集上的评估表明,在变量误用及其他编程缺陷等典型缺陷定位任务中,WELL显著优于现有最优模型。