Automated unit test generation aims to improve software quality while reducing the time and effort required for creating tests manually. However, existing techniques primarily generate regression oracles that predicate on the implemented behavior of the class under test. They do not address the oracle problem: the challenge of distinguishing correct from incorrect program behavior. With the rise of Foundation Models (FMs), particularly Large Language Models (LLMs), there is a new opportunity to generate test oracles that reflect intended behavior. This positions LLMs as enablers of Promptware, where software creation and testing are driven by natural-language prompts. This paper presents an empirical study on the effectiveness of LLMs in generating test oracles that expose software failures. We investigate how different prompting strategies and levels of contextual input impact the quality of LLM-generated oracles. Our findings offer insights into the strengths and limitations of LLM-based oracle generation in the FM era, improving our understanding of their capabilities and fostering future research in this area.
翻译:自动化单元测试生成旨在提高软件质量,同时减少手动创建测试所需的时间和精力。然而,现有技术主要生成基于被测类已实现行为的回归预言。它们并未解决预言问题:即区分正确与不正确程序行为的挑战。随着基础模型(FMs),特别是大语言模型(LLMs)的兴起,为生成反映预期行为的测试预言提供了新的机遇。这使LLMs成为提示软件的推动者,其中软件创建和测试由自然语言提示驱动。本文通过实证研究探讨LLMs在生成能暴露软件故障的测试预言方面的有效性。我们研究了不同的提示策略和上下文输入水平如何影响LLM生成预言的质量。我们的研究结果为FM时代基于LLM的预言生成的优势与局限性提供了见解,增进了对其能力的理解,并促进了该领域的未来研究。