Bug localization is a key software development task, where a developer locates the portion of the source code that must be modified based on the bug report. It is label-intensive and time-consuming due to the increasing size and complexity of the modern software. Effectively automating this task can greatly reduce costs by cutting down the developers' effort. Researchers have already made efforts to harness the great powerfulness of deep learning (DL) to automate bug localization. However, training DL models demands a large quantity of annotated training data, while the buggy-location-annotated dataset with reasonable quality and quantity is difficult to collect. This becomes an obstacle to the effective usage of DL for bug localization. We notice that the data pairs for bug detection, which provide weak buggy-or-not binary classification supervision, are much easier to obtain. Inspired by weakly supervised learning, this paper proposes WEakly supervised bug LocaLization (WELL), an approach to transform bug detectors to bug locators. Through the CodeBERT model finetuned by bug detection, WELL is capable to locate bugs in a weakly supervised manner based on the attention. The evaluations on three datasets of WELL show competitive performance with the existing strongly supervised DL solutions. WELL even outperforms current SOTA models in tasks of variable misuse and binary operator misuse.
翻译:缺陷定位是软件开发中的关键任务,开发者需根据缺陷报告定位需修改的源代码部分。随着现代软件规模与复杂性的增长,该任务对标签需求密集且耗时。有效自动化此任务可通过减少开发者工作量大幅降低成本。研究者已尝试利用深度学习(DL)的强大能力自动化缺陷定位,但训练DL模型需要大量带注释的训练数据,而具备合理质量与数量的缺陷位置标注数据集难以收集。这成为DL有效应用于缺陷定位的障碍。我们注意到,缺陷检测的数据对(提供二分类的弱监督信号)更易获取。受弱监督学习启发,本文提出WEakly supervised bug LocaLization(WELL)方法,将缺陷检测器转化为缺陷定位器。通过基于缺陷检测微调的CodeBERT模型,WELL能以弱监督方式基于注意力机制定位缺陷。在三个数据集上的评估表明,WELL展现出与现有强监督DL解决方案相竞争的性能,甚至在变量误用和二元运算符误用任务中超越当前最先进模型。