Automated Program Repair (APR) proposes bug fixes to aid developers in maintaining software. The state of the art in this domain focuses on LLMs, leveraging their strong capabilities to comprehend specifications in natural language and to generate program code. However, despite the APR community's research achievements and industry deployments, APR still cannot generalize broadly. In this work, we present an intensive empirical evaluation of LLMs' capabilities in APR. We evaluate a diverse set of 13 recent open and closed models. In particular, we explore language-agnostic repair by utilizing benchmarks for Java, JavaScript, Python, and PHP. Besides the generalization across languages and levels of patch complexity, we also investigate the effects of fault localization (FL). Our key results include: (1) Different LLMs tend to perform best for different languages, which makes it hard to develop cross-platform, single-LLM repair techniques. (2) Combining models by pooling repairs adds value with respect to uniquely fixed bugs, so a committee of expert models should be considered. (3) Under realistic assumptions of imperfect FL, we observe significant drops in accuracy from the usual practice of using perfect FL. Our insights will help develop reliable and generalizable APR techniques and evaluate them in realistic and fair environments.
翻译:自动程序修复(APR)通过提出错误修复方案来协助开发者维护软件。该领域的前沿研究聚焦于大型语言模型(LLM),利用其强大的自然语言规范理解与程序代码生成能力。然而,尽管APR领域已取得诸多研究成果并实现工业部署,现有方法仍难以实现广泛泛化。本研究对LLM在APR任务中的能力进行了系统性实证评估,涵盖了13种近期开闭源模型的多样化集合。我们通过Java、JavaScript、Python和PHP的基准测试集探索了语言无关的修复能力,除跨语言和补丁复杂度的泛化性外,还深入研究了错误定位(FL)的影响效应。主要发现包括:(1)不同LLM在不同编程语言中表现各异,这导致开发跨平台的单一LLM修复技术面临挑战;(2)通过聚合修复结果进行模型融合能有效提升独立修复错误的数量,应考虑构建专家模型委员会;(3)在错误定位不完善的现实假设下,相较于当前普遍采用的完美错误定位实践,修复准确率出现显著下降。本研究结论将为开发可靠且可泛化的APR技术,以及在现实公平环境中进行评估提供重要参考。