For software testing research, Defects4J stands out as the primary benchmark dataset, offering a controlled environment to study real bugs from prominent open-source systems. However, prior research indicates that Defects4J might include tests added post-bug report, embedding developer knowledge and affecting fault localization efficacy. In this paper, we examine Defects4J's fault-triggering tests, emphasizing the implications of developer knowledge of SBFL techniques. We study the timelines of changes made to these tests concerning bug report creation. Then, we study the effectiveness of SBFL techniques without developer knowledge in the tests. We found that 1) 55% of the fault-triggering tests were newly added to replicate the bug or to test for regression; 2) 22% of the fault-triggering tests were modified after the bug reports were created, containing developer knowledge of the bug; 3) developers often modify the tests to include new assertions or change the test code to reflect the changes in the source code; and 4) the performance of SBFL techniques degrades significantly (up to --415% for Mean First Rank) when evaluated on the bugs without developer knowledge. We provide a dataset of bugs without developer insights, aiding future SBFL evaluations in Defects4J and informing considerations for future bug benchmarks.
翻译:在软件测试研究中,Defects4J作为主要的基准数据集脱颖而出,它提供了一个受控环境来研究来自知名开源系统的真实缺陷。然而,先前的研究表明,Defects4J可能包含在缺陷报告后添加的测试用例,这些测试嵌入了开发者的知识,从而影响了故障定位的有效性。本文中,我们检查了Defects4J中触发故障的测试,重点关注开发者知识对基于频谱的故障定位(SBFL)技术的影响。我们研究了这些测试相对于缺陷报告创建时间所进行的更改时间线。随后,我们探究了在测试中不含开发者知识时SBFL技术的有效性。我们发现:1)55%的故障触发测试是为了复现缺陷或进行回归测试而新添加的;2)22%的故障触发测试在缺陷报告创建后被修改过,其中包含了开发者对缺陷的认知;3)开发者通常通过添加新断言或修改测试代码以反映源代码变更来调整测试;以及4)当在不含开发者知识的缺陷上进行评估时,SBFL技术的性能显著下降(平均首次排名指标降幅高达--415%)。我们提供了一个不含开发者洞察的缺陷数据集,以辅助未来在Defects4J中进行SBFL评估,并为未来缺陷基准的构建提供参考依据。