In the research of automated program repair (APR), benchmark datasets consisting of known defects in combination with test suites that indicate the defects are of high importance. They allow for an evidence-based comparison of different APR approaches. In our own work on APR we found significant challenges when working with widely used defect datasets, which go beyond mere repeatability of defects via test cases. We summarize these identified challenges and related lessons learned to bring them to the attention of the APR community and quantify the potential impact of them. In particular, we investigate the widely used benchmark Defects4J, which has according to Google Scholar over 1,800 citations. It consists of 835 defects from 17 open-source Java projects; a hand-curated collection of defects, test suites that clearly indicate the defect, and human patches where any unrelated changes are removed. We find that, when executing the test suites with strict requirements for reproducibility in APR settings (beyond merely reproducing the defect via test cases), 180 (21.6 %) of the defects are not suitable for evaluation experiments. Further, we find that an additional 59 (7.1 %) defects have test suites that are obviously under-specified, as deleting a single statement from the code base makes all test cases pass, although the human-written patch does not only delete code. Our contributions are: a systematic collection of requirements for defect datasets for APR beyond traditional reproducibility of defects, a description of practical experiences and quantitative analysis of problems with the Defects4J dataset, as well as an implementation of an evaluation framework for APR tools for Java programs. This evaluation framework does stricter checking for indications of inadequate test suites, to avoid otherwise unnoticed problems in the test suite, such as flaky tests.
翻译:在自动化程序修复(APR)研究中,由已知缺陷及其对应测试套件组成的基准数据集至关重要,它们支持对不同APR方法进行基于证据的比较。在我们自身的APR研究中,发现使用广泛采用的缺陷数据集时存在重大挑战,这些挑战超出了通过测试用例实现缺陷可重复性的范畴。我们总结了这些已识别的挑战及相关经验教训,以引起APR社区的关注并量化其潜在影响。具体而言,我们研究了广泛使用的基准数据集Defects4J(据谷歌学术统计,被引次数超过1800次),该数据集包含来自17个开源Java项目的835个缺陷——一个经过人工筛选的缺陷集合,包含明确指示缺陷的测试套件以及移除所有无关变更的人工修复补丁。我们发现,在APR设置中(超越仅通过测试用例复现缺陷)以严格可复现性要求执行测试套件时,其中180个缺陷(占21.6%)不适用于评估实验。此外,另有59个缺陷(占7.1%)的测试套件存在明显的欠定义问题:从代码库中删除单条语句即可使所有测试用例通过,尽管人工编写的补丁并非仅包含代码删除操作。我们的贡献包括:系统性地提出了超越传统缺陷可复现性的APR缺陷数据集需求集合,描述了基于实践经验的Defects4J数据集问题分析与定量研究,以及实现了针对Java程序的APR工具评估框架。该评估框架能更严格地检测测试套件不充分的迹象,以避免测试套件中未被注意的问题(如脆性测试)。