Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

In the research of automated program repair (APR), benchmark datasets consisting of known defects in combination with test suites that indicate the defects are of high importance. They allow for an evidence-based comparison of different APR approaches. In our own work on APR we found significant challenges when working with widely used defect datasets, which go beyond mere repeatability of defects via test cases. We summarize these identified challenges and related lessons learned to bring them to the attention of the APR community and quantify the potential impact of them. In particular, we investigate the widely used benchmark Defects4J, which has according to Google Scholar over 1,800 citations. It consists of 835 defects from 17 open-source Java projects; a hand-curated collection of defects, test suites that clearly indicate the defect, and human patches where any unrelated changes are removed. We find that, when executing the test suites with strict requirements for reproducibility in APR settings (beyond merely reproducing the defect via test cases), 180 (21.6 %) of the defects are not suitable for evaluation experiments. Further, we find that an additional 59 (7.1 %) defects have test suites that are obviously under-specified, as deleting a single statement from the code base makes all test cases pass, although the human-written patch does not only delete code. Our contributions are: a systematic collection of requirements for defect datasets for APR beyond traditional reproducibility of defects, a description of practical experiences and quantitative analysis of problems with the Defects4J dataset, as well as an implementation of an evaluation framework for APR tools for Java programs. This evaluation framework does stricter checking for indications of inadequate test suites, to avoid otherwise unnoticed problems in the test suite, such as flaky tests.

翻译：在自动化程序修复（APR）研究中，由已知缺陷及其对应测试套件组成的基准数据集至关重要，它们支持对不同APR方法进行基于证据的比较。在我们自身的APR研究中，发现使用广泛采用的缺陷数据集时存在重大挑战，这些挑战超出了通过测试用例实现缺陷可重复性的范畴。我们总结了这些已识别的挑战及相关经验教训，以引起APR社区的关注并量化其潜在影响。具体而言，我们研究了广泛使用的基准数据集Defects4J（据谷歌学术统计，被引次数超过1800次），该数据集包含来自17个开源Java项目的835个缺陷——一个经过人工筛选的缺陷集合，包含明确指示缺陷的测试套件以及移除所有无关变更的人工修复补丁。我们发现，在APR设置中（超越仅通过测试用例复现缺陷）以严格可复现性要求执行测试套件时，其中180个缺陷（占21.6%）不适用于评估实验。此外，另有59个缺陷（占7.1%）的测试套件存在明显的欠定义问题：从代码库中删除单条语句即可使所有测试用例通过，尽管人工编写的补丁并非仅包含代码删除操作。我们的贡献包括：系统性地提出了超越传统缺陷可复现性的APR缺陷数据集需求集合，描述了基于实践经验的Defects4J数据集问题分析与定量研究，以及实现了针对Java程序的APR工具评估框架。该评估框架能更严格地检测测试套件不充分的迹象，以避免测试套件中未被注意的问题（如脆性测试）。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

人工智能驱动的自动程序修复与代码生成的技术与进展全面综述

专知会员服务

25+阅读 · 2024年11月15日

大型语言模型自动程序修复的系统文献综述

专知会员服务

43+阅读 · 2024年5月5日

《自主系统的验证：结合仿真、公式化和实时飞行的自主 UAS 蜂群算法开发测试和评估》美国空军技术学院190页论文

专知会员服务

77+阅读 · 2022年12月20日

《在自修复系统中嵌入验证意识》美空军132页技术总结报告

专知会员服务

22+阅读 · 2022年11月3日