TOGLL: Correct and Strong Test Oracle Generation with LLMs

Test oracles play a crucial role in software testing, enabling effective bug detection. Despite initial promise, neural-based methods for automated test oracle generation often result in a large number of false positives and weaker test oracles. While LLMs have demonstrated impressive effectiveness in various software engineering tasks, including code generation, test case creation, and bug fixing, there remains a notable absence of large-scale studies exploring their effectiveness in test oracle generation. The question of whether LLMs can address the challenges in effective oracle generation is both compelling and requires thorough investigation. In this research, we present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles capable of effectively identifying a large number of unique bugs. To this end, we fine-tuned seven code LLMs using six distinct prompts on the SF110 dataset. Utilizing the most effective fine-tuned LLM and prompt pair, we introduce TOGLL, a novel LLM-based method for test oracle generation. To investigate the generalizability of TOGLL, we conduct studies on 25 large-scale Java projects. Besides assessing the correctness, we also assess the diversity and strength of the generated oracles. We compare the results against EvoSuite and the state-of-the-art neural method, TOGA. Our findings reveal that TOGLL can produce 3.8 times more correct assertion oracles and 4.9 times more exception oracles. Moreover, our findings demonstrate that TOGLL is capable of generating significantly diverse test oracles. It can detect 1,023 unique bugs that EvoSuite cannot, which is ten times more than what the previous SOTA neural-based method, TOGA, can detect.

翻译：测试预言在软件测试中扮演着关键角色，能够有效支持缺陷检测。尽管基于神经网络的自动化测试预言生成方法最初展现出潜力，但其常产生大量误报并生成较弱的测试预言。虽然大型语言模型（LLMs）已在多项软件工程任务中展现出卓越效能，包括代码生成、测试用例创建和缺陷修复，但在测试预言生成领域仍缺乏大规模研究来深入探讨其有效性。LLMs能否应对有效预言生成中的挑战，这一问题既引人关注，亦需深入探究。本研究首次开展全面探索，旨在评估LLMs生成正确、多样且强大的测试预言的能力，这些预言需能有效识别大量独特缺陷。为此，我们在SF110数据集上使用六种不同提示对七个代码LLMs进行微调。基于最有效的微调LLM与提示组合，我们提出了TOGLL——一种基于LLM的新型测试预言生成方法。为探究TOGLL的泛化能力，我们在25个大规模Java项目上展开实验。除评估生成预言的正确性外，我们还系统分析了其多样性与强度，并将结果与EvoSuite及当前最先进的神经方法TOGA进行对比。实验结果表明：TOGLL能生成3.8倍于基准的正确断言预言及4.9倍的正确异常预言；同时，TOGLL生成的测试预言具有显著多样性，可检测出1,023个EvoSuite无法发现的独特缺陷，这一数量达到先前最先进神经方法TOGA检测能力的十倍。