FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks

Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their ability to comprehend and leverage diverse types of feedback, which is crucial for iterative self-correction in authentic debugging scenarios, remains insufficiently understood. To bridge this gap, we introduce FeedbackEval, a systematic benchmark constructed from three heterogeneous sources (HumanEval, CoderEval, and SWE-Bench-verified), to evaluate LLMs' feedback comprehension and code repair performance. We conduct a comprehensive empirical study on five state-of-the-art LLMs, including GPT-4o, Claude-3.5, Deepseek-R1, GLM-4, and Qwen2.5, to evaluate their behavior under both single-iteration and iterative code repair settings. Our results show that mixed feedback yields the highest repair success (63.6%), with LLM-Expert and test feedback providing strong targeted gains (62.9% and 57.9%, respectively), while minimal (53.1%) and compiler feedback (49.2%) offer moderate benefits and LLM-Skilled proves least effective (48.8%). Iterative feedback further enhances repair performance, though the marginal benefit diminishes after two or three iterations. Moreover, prompt structure is shown to be critical: structured reasoning (RR, CoT) and dynamic example selection deliver notable improvements, whereas removing semantic cues such as docstrings or role-play causes severe degradation. This work introduces a robust benchmark and delivers practical insights to advance the understanding and development of feedback-driven code repair using LLMs.

翻译：代码修复是软件开发中的一项基础任务，有助于高效解决程序缺陷和维护软件。尽管大语言模型在自动化代码修复方面已展现出巨大潜力，但其理解和利用多样化反馈的能力——这在真实调试场景中的迭代自我修正至关重要——仍未得到充分理解。为填补这一空白，我们提出了FeedbackEval，这是一个从三个异构来源（HumanEval、CoderEval和SWE-Bench-verified）构建的系统化基准，用于评估大语言模型的反馈理解与代码修复性能。我们对包括GPT-4o、Claude-3.5、Deepseek-R1、GLM-4和Qwen2.5在内的五个前沿大语言模型进行了全面的实证研究，以评估它们在单次迭代和迭代代码修复设置下的表现。我们的结果表明，混合反馈能带来最高的修复成功率（63.6%），其中LLM-Expert反馈和测试反馈分别提供了显著的针对性增益（分别为62.9%和57.9%），而最小反馈（53.1%）和编译器反馈（49.2%）则带来中等收益，LLM-Skilled反馈被证明效果最差（48.8%）。迭代反馈能进一步提升修复性能，但边际效益在两到三次迭代后逐渐减弱。此外，提示结构被证明至关重要：结构化推理（RR、CoT）和动态示例选择能带来显著改进，而移除文档字符串或角色扮演等语义线索则会导致性能严重下降。本研究引入了一个稳健的基准，并为推进基于大语言模型的反馈驱动代码修复的理解与发展提供了实用见解。