Revisiting Unnaturalness for Automated Program Repair in the Era of Large Language Models

Language models have improved by orders of magnitude with the recent emergence of Transformer-based Large Language Models (LLMs). LLMs have demonstrated their ability to generate natural code that is highly similar to code written by professional developers. One intermediate value an LLM can emit is entropy, which measures the naturalness of a token of code. We hypothesize that entropy can be used to improve the performance of Automated Program Repair (APR) tasks. While much progress has been made in Automated Program Repair (APR), fault localization techniques suffer from a lack of diversity in ranking scores, patch generation tools tend to be inefficient as all tests need to run before determining if a patch is likely to be correct, and patch ranking often suffers from the test-suite over-fitting problem. However, using an LLM directly for APR introduces concerns for training data leakage. In this work, we introduce a novel way of using the entropy of LLMs in combination with prior APR tools to improve all stages of APR. We show that entropy is highly complementary with prior fault localization tools. Our proposed re-ranking method achieves a 50% Top-5 score improvement over SBFL. We propose a patch-naturalness measurement, entropy-delta, to improve the efficiency of template-based repair techniques by ranking plausible patches before undergoing testing. When using entropy-delta for patch ranking and classification, our proposed method can rank correct patches more effectively than state-of-the-art machine learning tools with an 49% improvement in Top-1. Our work suggests that LLMs can be an effective addition to compliment prior APR tasks while minimizing both the test-suite overfitting problem and the LLM data leakage problem.

翻译：语言模型随着基于Transformer的大语言模型（LLMs）的近期涌现，已实现数量级的性能提升。LLMs展现出生成与专业开发人员所写代码高度相似的自然代码的能力。LLMs可输出的一个中间值是熵，它衡量代码词元的自然性。我们假设熵可用于提升自动程序修复（APR）任务的性能。尽管自动程序修复（APR）取得了诸多进展，但故障定位技术仍受限于排序分数缺乏多样性，补丁生成工具往往效率低下，因为所有测试都必须运行后才能确定补丁是否可能正确，而补丁排序常受测试套件过拟合问题的困扰。然而，直接使用LLMs进行APR会引发训练数据泄漏的担忧。在本工作中，我们提出一种新颖方法，将LLMs的熵与现有APR工具相结合，以改进APR的各个阶段。我们证明熵与现有故障定位工具具有高度互补性。本文提出的重排序方法相比SBFL在Top-5分数上实现了50%的提升。我们提出一种补丁自然性度量指标——熵差，通过先对可行补丁进行排序再进行测试，从而提升基于模板的修复技术的效率。当使用熵差进行补丁排序与分类时，本文方法在Top-1排名上相比最先进的机器学习工具实现了49%的提升，能更有效地对正确补丁进行排序。本研究表明，LLMs可作为有效补充来增强现有APR任务，同时最大程度减少测试套件过拟合与LLM数据泄漏问题。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日