Neural-Based Test Oracle Generation: A Large-scale Evaluation and Lessons Learned

Defining test oracles is crucial and central to test development, but manual construction of oracles is expensive. While recent neural-based automated test oracle generation techniques have shown promise, their real-world effectiveness remains a compelling question requiring further exploration and understanding. This paper investigates the effectiveness of TOGA, a recently developed neural-based method for automatic test oracle generation by Dinella et al. TOGA utilizes EvoSuite-generated test inputs and generates both exception and assertion oracles. In a Defects4j study, TOGA outperformed specification, search, and neural-based techniques, detecting 57 bugs, including 30 unique bugs not detected by other methods. To gain a deeper understanding of its applicability in real-world settings, we conducted a series of external, extended, and conceptual replication studies of TOGA. In a large-scale study involving 25 real-world Java systems, 223.5K test cases, and 51K injected faults, we evaluate TOGA's ability to improve fault-detection effectiveness relative to the state-of-the-practice and the state-of-the-art. We find that TOGA misclassifies the type of oracle needed 24% of the time and that when it classifies correctly around 62% of the time it is not confident enough to generate any assertion oracle. When it does generate an assertion oracle, more than 47% of them are false positives, and the true positive assertions only increase fault detection by 0.3% relative to prior work. These findings expose limitations of the state-of-the-art neural-based oracle generation technique, provide valuable insights for improvement, and offer lessons for evaluating future automated oracle generation methods.

翻译：定义测试预言对于测试开发至关重要且处于核心地位，但手工构建测试预言成本高昂。尽管近期基于神经网络的自动化测试预言生成技术展现出潜力，其在实际环境中的有效性仍需进一步探索与理解。本文研究了由Dinella等人最新开发的基于神经网络的测试预言自动生成方法TOGA的有效性。TOGA利用EvoSuite生成的测试输入，可同时生成异常预言与断言预言。在Defects4j研究中，TOGA的表现优于规范、搜索和基于神经网络的技术，检测到57个缺陷，其中包括其他方法未能检测到的30个独特缺陷。为深入理解其在实际环境中的适用性，我们开展了一系列针对TOGA的外部、扩展及概念复制研究。在一项涉及25个真实Java系统、22.35万个测试用例和5.1万个注入故障的大规模研究中，我们评估了TOGA相对于当前实践与先进技术提升故障检测有效性的能力。研究发现：TOGA在24%的情况下错误分类所需预言类型；当正确分类时，约62%的情况下其置信度不足以生成任何断言预言；而生成断言预言时，超过47%为假阳性，真阳性断言仅比先前工作提高了0.3%的故障检测率。这些发现揭示了当前最先进神经网络预言生成技术的局限性，为改进提供了宝贵见解，并为评估未来自动化预言生成方法提供了经验教训。