Self-Bootstrapping Automated Program Repair: Using LLMs to Generate and Evaluate Synthetic Training Data for Bug Repair

This paper presents a novel methodology for enhancing Automated Program Repair (APR) through synthetic data generation utilizing Large Language Models (LLMs). Current APR systems are constrained by the limited availability of high-quality training data encompassing diverse bug types across multiple programming languages. The proposed approach addresses this limitation through a two-phase process: a synthetic sample generation followed by a rigorous quality assessment. Multiple state-of-the-art LLMs were employed to generate approximately 30,000 paired examples of buggy and fixed code across 12 programming languages and 13 bug categories. Subsequently, these samples underwent cross-model evaluation against five criteria: correctness, code quality, security, performance, and completeness. Experimental evaluation on the VulRepair test set dataset showed statistically significant improvements in Perfect Prediction rates, with the quality-filtered synthetic dataset achieving 17.18% (Top@1) and 23.00% (Top@5) compared to the baseline's 11.68% and 18.88% respectively, representing a 47% relative improvement in Top@1 and 22% in Top@5. The methodology was validated through rigorous statistical testing, including ANOVA and post-hoc Tukey's Honest Significant Difference analysis. Furthermore, the best-performing configurations surpassed existing systems despite using a less computationally intensive decoding strategy. This research establishes a self-bootstrapping paradigm in which LLMs generate and evaluate their own training data, suggesting promising directions for addressing data scarcity in similar software engineering tasks and advancing the development of robust, adaptable tools for automated code maintenance.

翻译：本文提出了一种新颖的方法，通过利用大型语言模型（LLMs）生成合成数据来增强自动程序修复（APR）。当前的APR系统受限于缺乏涵盖多种编程语言及多样化缺陷类型的高质量训练数据。所提出的方法通过两个阶段解决这一局限：首先是合成样本生成，随后是严格的质量评估。我们采用多种先进LLMs，在12种编程语言和13种缺陷类别中生成了约30,000对包含缺陷代码与修复代码的示例。随后，这些样本经历了跨模型评估，评估标准涵盖五项指标：正确性、代码质量、安全性、性能及完整性。在VulRepair测试集上的实验结果表明，完美预测率（Perfect Prediction）得到了显著提升：经过质量筛选的合成数据集在Top@1和Top@5指标上分别达到17.18%和23.00%，而基线方法的对应值分别为11.68%和18.88%，相对提升幅度为Top@1提高47%、Top@5提高22%。该方法通过严格的统计检验（包括方差分析（ANOVA）与事后Tukey诚实显著性差异分析）进行了验证。此外，尽管采用了计算强度较低的解码策略，最优配置仍超越了现有系统的性能。本研究确立了一种自举式范式，即由LLMs自主生成并评估其训练数据，为缓解类似软件工程任务中的数据稀缺问题以及推动鲁棒、可扩展的自动化代码维护工具的发展提供了有前景的方向。