The complexity of modern software has led to a drastic increase in the time and cost associated with detecting and rectifying software bugs. In response, researchers have explored various methods to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible fixes for any given bug, few tools and datasets are available to evaluate model-generated fixes effectively. To address this issue, we introduce FixEval, a benchmark comprising of buggy code submissions to competitive programming problems and their corresponding fixes. FixEval offers an extensive collection of unit tests to evaluate the correctness of model-generated program fixes and assess further information regarding time, memory constraints, and acceptance based on a verdict. We consider two Transformer language models pretrained on programming languages as our baseline and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately. At the same time, execution-based methods evaluate programs through all cases and scenarios designed explicitly for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation. The dataset and models are open-sourced at https://github.com/mahimanzum/FixEval.
翻译:现代软件的复杂性导致检测和修复软件缺陷所需的时间和成本大幅增加。为此,研究人员探索了多种自动生成错误代码修复方案的方法。然而,由于给定缺陷可能对应的修复方案存在巨大的组合空间,目前可用于有效评估模型生成修复方案的工具有限。针对这一问题,我们提出FixEval基准测试集,包含针对编程竞赛问题的缺陷代码提交及其对应修复方案。FixEval提供丰富的单元测试集,用于评估模型生成程序修复的正确性,并基于判定结果进一步获取时间、内存约束及验收状态等信息。我们以两种预训练于编程语言的Transformer语言模型作为基线模型,采用基于匹配和基于执行的评估指标进行比较。实验表明,基于匹配的指标无法准确反映模型生成的程序修复效果,而基于执行的方法则能通过针对解决方案专门设计的全部用例和场景对程序进行评估。因此,我们认为FixEval为面向真实场景的自动缺陷修复及模型生成代码评估迈出了重要一步。数据集与模型已开源至https://github.com/mahimanzum/FixEval。