Program repair techniques offer cost-saving benefits for debugging within software development and programming education scenarios. With the proven effectiveness of Large Language Models (LLMs) in code-related tasks, researchers have explored their potential for program repair. However, it is crucial to recognize that existing repair benchmarks may have influenced LLM training data, potentially causing data leakage. To evaluate LLMs' realistic repair capabilities, (1) we introduce an extensive, non-crawled benchmark, referred to as TutorCode, comprising 1,239 C++ defect codes and associated information such as tutor guidance, solution description, failing test cases, and the corrected code. Our work assesses the repair performance of 12 LLMs on TutorCode, measuring repair correctness (TOP-5 and AVG-5) and patch precision (RPSR). (2) We then provide a comprehensive investigation into which types of extra information can help LLMs improve their performance in repairing defects. Among these types, tutor guidance was found to be the most effective information in enhancing LLM repair capabilities. To fully harness LLMs' conversational capabilities and the benefits of augmented information, (3) we introduce a novel conversational semi-automatic repair framework CREF assisting human tutor. It demonstrates a remarkable AVG-5 improvement of 17.2%-24.6% compared to the baseline, achieving an impressive AVG-5 of 76.6% when utilizing GPT-4. These results highlight the potential for enhancing LLMs' repair capabilities through interactions with tutors and historical conversations involving incorrect responses. The successful application of CREF in a real-world educational setting demonstrates its effectiveness in reducing tutors' workload and improving students' learning experience, while also showcasing its promise for facilitating other software engineering tasks, such as code review.
翻译:程序修复技术在软件开发和编程教育场景中为调试提供了成本节约效益。随着大语言模型(LLMs)在代码相关任务中被证明有效,研究者们已探索了其在程序修复方面的潜力。然而,必须认识到现有的修复基准可能已影响LLM的训练数据,可能导致数据泄露。为了评估LLMs的实际修复能力,(1)我们引入了一个广泛的、非爬取的基准,称为TutorCode,包含1,239个C++缺陷代码及相关信息,如导师指导、解决方案描述、失败测试用例和修正后的代码。我们的工作评估了12个LLMs在TutorCode上的修复性能,测量了修复正确率(TOP-5和AVG-5)和补丁精确度(RPSR)。(2)随后,我们全面研究了哪些类型的额外信息能帮助LLMs提升其修复缺陷的性能。在这些类型中,导师指导被发现是最有效提升LLM修复能力的信息。为了充分利用LLMs的对话能力和增强信息的好处,(3)我们引入了一种新颖的对话式半自动修复框架CREF以辅助人类导师。与基线相比,它实现了17.2%-24.6%的显著AVG-5提升,当使用GPT-4时达到了76.6%的令人印象深刻的AVG-5。这些结果凸显了通过与导师的互动以及涉及错误响应的历史对话来增强LLMs修复能力的潜力。CREF在真实教育环境中的成功应用证明了其在减轻导师工作负担和改善学生学习体验方面的有效性,同时也展示了其在促进其他软件工程任务(如代码审查)方面的前景。