Sequence-to-sequence models have been used to transform erroneous programs into correct ones when trained with a large enough dataset. Some recent studies also demonstrated strong empirical evidence that code review could improve the program repair further. Large language models, trained with Natural Language (NL) and Programming Language (PL), can contain inherent knowledge of both. In this study, we investigate if this inherent knowledge of PL and NL can be utilized to improve automated program repair. We applied PLBART and CodeT5, two state-of-the-art language models that are pre-trained with both PL and NL, on two such natural language-based program repair datasets and found that the pre-trained language models fine-tuned with datasets containing both code review and subsequent code changes notably outperformed each of the previous models. With the advent of code generative models like Codex and GPT-3.5-Turbo, we also performed zero-shot and few-shots learning-based prompt engineering to assess their performance on these datasets. However, the practical application of using LLMs in the context of automated program repair is still a long way off based on our manual analysis of the generated repaired codes by the learning models.
翻译:序列到序列模型在充足数据集训练下,已被用于将错误程序转换为正确程序。近期研究也提供了有力实证证据,表明代码审查可进一步改进程序修复。兼具自然语言与编程语言知识训练的大型语言模型,能够同时包含两者的内在知识。本研究探究这种编程语言与自然语言的内在知识能否用于改进自动程序修复。我们应用PLBART和CodeT5这两种同时基于编程语言和自然语言预训练的最先进语言模型,在两个基于自然语言的程序修复数据集上进行实验,发现经同时包含代码审查及后续代码更改的数据集微调后,预训练语言模型的表现显著优于此前所有模型。随着Codex和GPT-3.5-Turbo等代码生成模型的出现,我们还进行了零样本和少样本学习的提示工程,以评估这些模型在上述数据集上的性能。然而,根据我们对学习模型生成的修复代码进行的人工分析,在自动程序修复场景中实际应用大型语言模型仍任重道远。