Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing

In recent years, Large language model-powered Automated Program Repair (LAPR) techniques have achieved state-of-the-art bug-fixing performance and have been pervasively applied and studied in both industry and academia. Nonetheless, LLMs were proved to be highly sensitive to input prompts, with slight differences in the expressions of semantically equivalent programs potentially causing repair failures. Therefore, it is crucial to conduct robustness testing on LAPR techniques before their practical deployment. However, related research is scarce. To this end, we propose MT-LAPR, a Metamorphic Testing framework exclusively for LAPR techniques, which summarizes nine widely-recognized Metamorphic Relations (MRs) by developers across three perturbation levels: token, statement, and block. Afterward, our proposed MRs are applied to buggy codes to generate test cases, which are semantically equivalent yet to affect the inference of LAPR. Experiments are carried out on two extensively examined bug-fixing datasets, i.e., Defect4J and QuixBugs, and four bug-fixing abled LLMs released recently, demonstrating that 34.4% - 48.5% of the test cases expose the instability of LAPR techniques on average, showing the effectiveness of MT-LAPR and uncovering a positive correlation between code readability and the robustness of LAPR techniques. Inspired by the above findings, this paper uses the test cases generated by MT-LAPR as samples to train a CodeT5-based code editing model aiming at improving code readability and then embeds it into the LAPR workflow as a data preprocessing step. Extensive experiments demonstrate that this approach significantly enhances the robustness of LAPR by 49.32% at most.

翻译：近年来，基于大语言模型的自动程序修复技术已取得最先进的缺陷修复性能，并在工业界与学术界得到广泛应用和研究。然而，大语言模型已被证明对输入提示高度敏感，语义等价程序在表达上的细微差异都可能导致修复失败。因此，在LAPR技术实际部署前对其进行鲁棒性测试至关重要，但相关研究却十分匮乏。为此，我们提出了MT-LAPR——一个专为LAPR技术设计的蜕变测试框架，该框架从三个扰动层面（词元、语句和代码块）总结了九种被开发者广泛认可的蜕变关系。随后，我们将提出的蜕变关系应用于缺陷代码以生成测试用例，这些用例在语义上等价但可能影响LAPR的推理过程。我们在两个被广泛研究的缺陷修复数据集（Defect4J和QuixBugs）以及四个近期发布的具备缺陷修复能力的大语言模型上进行了实验，结果表明平均有34.4%至48.5%的测试用例暴露了LAPR技术的不稳定性，这既验证了MT-LAPR的有效性，也揭示了代码可读性与LAPR技术鲁棒性之间的正相关关系。受上述发现启发，本文利用MT-LAPR生成的测试用例作为样本，训练了一个基于CodeT5的代码编辑模型以提升代码可读性，并将其作为数据预处理步骤嵌入到LAPR工作流中。大量实验证明，该方法最高可将LAPR的鲁棒性显著提升49.32%。