Automated Program Repair (APR) techniques have drawn wide attention from both academia and industry. Meanwhile, one main limitation with the current state-of-the-art APR tools is that patches passing all the original tests are not necessarily the correct ones wanted by developers, i.e., the plausible patch problem. To date, various Patch-Correctness Checking (PCC) techniques have been proposed to address this important issue. However, they are only evaluated on very limited datasets as the APR tools used for generating such patches can only explore a small subset of the search space of possible patches, posing serious threats to external validity to existing PCC studies. In this paper, we construct an extensive PCC dataset (the largest manually labeled PCC dataset to our knowledge) to revisit all state-of-the-art PCC techniques. More specifically, our PCC dataset includes 1,988 patches generated from the recent PraPR APR tool, which leverages highly-optimized bytecode-level patch executions and can exhaustively explore all possible plausible patches within its large predefined search space (including well-known fixing patterns from various prior APR tools). Our extensive study of representative PCC techniques on the new dataset has revealed various surprising findings and provided guidelines for future PCC research.
翻译:自动程序修复(APR)技术已引起学术界和工业界的广泛关注。然而,当前最先进的APR工具存在一个主要局限性:通过所有原始测试的补丁未必是开发者期望的正确补丁,即存在“合理的补丁”问题。迄今为止,已有多种补丁正确性检查(PCC)技术被提出以解决这一重要问题。但这些技术仅在极其有限的数据集上进行评估,因为用于生成此类补丁的APR工具仅能探索可能补丁搜索空间的一个小子集,这给现有PCC研究的外部有效性带来了严重威胁。本文构建了一个大规模的PCC数据集(据我们所知是最大的手动标注PCC数据集),以重新评估所有最先进的PCC技术。具体而言,该PCC数据集包含来自最近发布的PraPR APR工具生成的1,988个补丁。该工具利用高度优化的字节码级补丁执行,能够在其预定义的大规模搜索空间内穷举探索所有可能的合理补丁(包括来自先前多种APR工具的已知修复模式)。我们在新数据集上对代表性PCC技术进行的大规模研究揭示了诸多令人意外的发现,并为未来PCC研究提供了指导方针。