Enlightened by the big success of pre-training in natural language processing, pre-trained models for programming languages have been widely used to promote code intelligence in recent years. In particular, BERT has been used for bug localization tasks and impressive results have been obtained. However, these BERT-based bug localization techniques suffer from two issues. First, the pre-trained BERT model on source code does not adequately capture the deep semantics of program code. Second, the overall bug localization models neglect the necessity of large-scale negative samples in contrastive learning for representations of changesets and ignore the lexical similarity between bug reports and changesets during similarity estimation. We address these two issues by 1) proposing a novel directed, multiple-label code graph representation named Semantic Flow Graph (SFG), which compactly and adequately captures code semantics, 2) designing and training SemanticCodeBERT based on SFG, and 3) designing a novel Hierarchical Momentum Contrastive Bug Localization technique (HMCBL). Evaluation results show that our method achieves state-of-the-art performance in bug localization.
翻译:受自然语言处理中预训练技术巨大成功的启发,近年来编程语言的预训练模型被广泛用于促进代码智能。特别是,BERT已被用于缺陷定位任务并取得了显著成果。然而,这些基于BERT的缺陷定位技术存在两个问题:首先,在源代码上预训练的BERT模型未能充分捕获程序代码的深层语义;其次,整体缺陷定位模型在变更集的对比学习中忽略了大规模负样本的必要性,并且在相似度估计时忽略了错误报告与变更集之间的词汇相似性。我们通过以下方式解决这两个问题:1)提出一种新颖的有向多标签代码图表示——语义流图(SFG),该表示紧凑且充分地捕获代码语义;2)基于SFG设计并训练SemanticCodeBERT;3)设计一种新颖的分层动量对比缺陷定位技术(HMCBL)。评估结果表明,我们的方法在缺陷定位中达到了最先进的性能。