TOGLL: Correct and Strong Test Oracle Generation with LLMs

Test oracles play a crucial role in software testing, enabling effective bug detection. Despite initial promise, neural- based methods for automated test oracle generation often result in a large number of false positives and weaker test oracles. While LLMs have demonstrated impressive effectiveness in various software engineering tasks, including code generation, test case creation, and bug fixing, there remains a notable absence of large-scale studies exploring their effectiveness in test oracle generation. The question of whether LLMs can address the challenges in effective oracle generation is both compelling and requires thorough investigation. In this research, we present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles capable of effectively identifying a large number of unique bugs. To this end, we fine-tuned seven code LLMs using six distinct prompts on the SF110 dataset. Utilizing the most effective fine-tuned LLM and prompt pair, we introduce TOGLL, a novel LLM-based method for test oracle generation. To investigate the generalizability of TOGLL, we conduct studies on 25 large-scale Java projects. Besides assessing the correctness, we also assess the diversity and strength of the generated oracles. We compare the results against EvoSuite and the state-of-the-art neural method, TOGA. Our findings reveal that TOGLL can produce 3.8 times more correct assertion oracles and 4.9 times more exception oracles. Moreover, our findings demonstrate that TOGLL is capable of generating significantly diverse test oracles. It can detect 1,023 unique bugs that EvoSuite cannot, which is ten times more than what the previous SOTA neural-based method, TOGA, can detect.

翻译：测试预言在软件测试中扮演着关键角色，能够有效检测缺陷。尽管基于神经网络的自动化测试预言生成方法初步展现出潜力，但常导致大量误报和较弱的测试预言。虽然大语言模型（LLM）在代码生成、测试用例创建和缺陷修复等多项软件工程任务中展现出显著效果，但目前尚缺乏大规模研究探索其在测试预言生成中的有效性。LLM能否应对有效预言生成中的挑战？这一问题既引人关注，又需深入探究。本研究首次全面调查了LLM生成正确、多样且强健的测试预言（能够有效识别大量独特缺陷）的能力。为此，我们利用SF110数据集上的六种不同提示，对七个代码LLM进行了微调。通过选择最优的微调LLM与提示组合，我们提出了TOGLL——一种新颖的基于LLM的测试预言生成方法。为验证TOGLL的泛化能力，我们在25个大规模Java项目上开展了研究。除正确性外，我们还评估了所生成预言的多样性与强度，并将结果与EvoSuite及现有最优神经方法TOGA进行了对比。研究发现，TOGLL能生成正确断言预言的数量是TOGA的3.8倍，异常预言数量是其4.9倍。此外，TOGLL可生成显著多样化的测试预言，能够检测到EvoSuite无法发现的1023个独特缺陷，这一数量是先前最优神经方法TOGA的十倍。