Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of Software Engineering (SE) tasks, such as Automated Program Repair (APR), code summarization, and code completion. For example, ChatGPT, the latest black-box LLM, has been investigated by numerous recent research studies and has shown impressive performance in various tasks. However, there exists a potential risk of data leakage since these LLMs are usually close-sourced with unknown specific training details, e.g., pre-training datasets. In this paper, we seek to review the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives. We first introduce {\benchmark}, a new benchmark with buggy and the corresponding fixed programs from competitive programming problems starting from 2023, after the training cutoff point of ChatGPT. The results on {\benchmark} show that ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5\% and 62.4\% prediction accuracy. We also investigate the impact of three types of prompts, i.e., problem description, error feedback, and bug localization, leading to additional 34 fixed bugs. Besides, we provide additional discussion from the interactive nature of ChatGPT to illustrate the capacity of a dialog-based repair workflow with 9 additional fixed bugs. Inspired by the findings, we further pinpoint various challenges and opportunities for advanced SE study equipped with such LLMs (e.g.,~ChatGPT) in the near future. More importantly, our work calls for more research on the reevaluation of the achievements obtained by existing black-box LLMs across various SE tasks, not limited to ChatGPT on APR.
翻译:大型语言模型(LLMs)在自动化程序修复(APR)、代码摘要和代码补全等各类软件工程(SE)任务中日益受到关注,并展现出令人瞩目的性能。例如,作为最新的黑盒LLM,ChatGPT已被大量近期研究验证,在多种任务中表现出色。然而,由于这些LLM通常采用闭源模式且具体训练细节(如预训练数据集)未知,存在潜在的数据泄露风险。本文旨在通过一个干净的APR基准测试集,从不同研究目标出发评述ChatGPT的缺陷修复能力。我们首先引入{\benchmark}——一个从2023年(ChatGPT训练截止点后)的竞赛编程问题中提取的新基准测试集,包含缺陷程序及其对应的修复版本。在{\benchmark}上的实验表明,ChatGPT能够在35次独立实验中通过基础提示语修复151个缺陷程序中的109个,预测准确率分别超越当前最先进的LLM模型CodeT5和PLBART达27.5%和62.4%。此外,我们探究了三种提示类型(问题描述、错误反馈与缺陷定位)的改进效果,额外修复了34个缺陷。同时,围绕ChatGPT的交互特性,我们通过对话式修复流程进一步修复了9个缺陷,并展开补充讨论。基于研究发现,我们进一步指出未来结合此类LLM(如ChatGPT)开展高级SE研究面临的多重挑战与机遇。更重要的是,本研究呼吁学界在各类SE任务中重新评估现有黑盒LLM(不限于ChatGPT在APR任务中)所取得的成果。