Language models have improved by orders of magnitude with the recent emergence of Transformer-based Large Language Models (LLMs). LLMs have demonstrated their ability to generate natural code that is highly similar to code written by professional developers. One intermediate value an LLM can emit is entropy, which measures the naturalness of a token of code. We hypothesize that entropy can be used to improve the performance of Automated Program Repair (APR) tasks. While much progress has been made in Automated Program Repair (APR), fault localization techniques suffer from a lack of diversity in ranking scores, patch generation tools tend to be inefficient as all tests need to run before determining if a patch is likely to be correct, and patch ranking often suffers from the test-suite over-fitting problem. However, using an LLM directly for APR introduces concerns for training data leakage. In this work, we introduce a novel way of using the entropy of LLMs in combination with prior APR tools to improve all stages of APR. We show that entropy is highly complementary with prior fault localization tools. Our proposed re-ranking method achieves a 50% Top-5 score improvement over SBFL. We propose a patch-naturalness measurement, entropy-delta, to improve the efficiency of template-based repair techniques by ranking plausible patches before undergoing testing. When using entropy-delta for patch ranking and classification, our proposed method can rank correct patches more effectively than state-of-the-art machine learning tools with an 49% improvement in Top-1. Our work suggests that LLMs can be an effective addition to compliment prior APR tasks while minimizing both the test-suite overfitting problem and the LLM data leakage problem.
翻译:语言模型随着基于Transformer的大语言模型(LLMs)的近期涌现,已实现数量级的性能提升。LLMs展现出生成与专业开发人员所写代码高度相似的自然代码的能力。LLMs可输出的一个中间值是熵,它衡量代码词元的自然性。我们假设熵可用于提升自动程序修复(APR)任务的性能。尽管自动程序修复(APR)取得了诸多进展,但故障定位技术仍受限于排序分数缺乏多样性,补丁生成工具往往效率低下,因为所有测试都必须运行后才能确定补丁是否可能正确,而补丁排序常受测试套件过拟合问题的困扰。然而,直接使用LLMs进行APR会引发训练数据泄漏的担忧。在本工作中,我们提出一种新颖方法,将LLMs的熵与现有APR工具相结合,以改进APR的各个阶段。我们证明熵与现有故障定位工具具有高度互补性。本文提出的重排序方法相比SBFL在Top-5分数上实现了50%的提升。我们提出一种补丁自然性度量指标——熵差,通过先对可行补丁进行排序再进行测试,从而提升基于模板的修复技术的效率。当使用熵差进行补丁排序与分类时,本文方法在Top-1排名上相比最先进的机器学习工具实现了49%的提升,能更有效地对正确补丁进行排序。本研究表明,LLMs可作为有效补充来增强现有APR任务,同时最大程度减少测试套件过拟合与LLM数据泄漏问题。