One of the critical phases in software development is software testing. Testing helps with identifying potential bugs and reducing maintenance costs. The goal of automated test generation tools is to ease the development of tests by suggesting efficient bug-revealing tests. Recently, researchers have leveraged Large Language Models (LLMs) of code to generate unit tests. While the code coverage of generated tests was usually assessed, the literature has acknowledged that the coverage is weakly correlated with the efficiency of tests in bug detection. To improve over this limitation, in this paper, we introduce MuTAP for improving the effectiveness of test cases generated by LLMs in terms of revealing bugs by leveraging mutation testing. Our goal is achieved by augmenting prompts with surviving mutants, as those mutants highlight the limitations of test cases in detecting bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs). We employ different LLMs within MuTAP and evaluate their performance on different benchmarks. Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets. Among these, 17% remained undetected by both the current state-of-the-art fully automated test generation tool (i.e., Pynguin) and zero-shot/few-shot learning approaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57% on synthetic buggy code, outperforming all other approaches in our evaluation. Our findings suggest that although LLMs can serve as a useful tool to generate test cases, they require specific post-processing steps to enhance the effectiveness of the generated test cases which may suffer from syntactic or functional errors and may be ineffective in detecting certain types of bugs and testing corner cases PUTs.
翻译:软件测试是软件开发的关键阶段之一。测试有助于识别潜在缺陷并降低维护成本。自动化测试生成工具的目标是通过提出高效的缺陷发现测试来简化测试开发。近年来,研究人员利用代码大语言模型(LLM)生成单元测试。虽然通常评估生成测试的代码覆盖率,但文献已认识到覆盖率与测试在缺陷检测中的效率相关性较弱。为改进这一局限,本文引入MuTAP,通过利用变异测试提升LLM生成测试用例在揭示缺陷方面的有效性。我们的目标是通过向提示中补充存活变异体来实现,因为这些变异体突出了测试用例在检测缺陷方面的局限性。MuTAP能够在缺乏被测程序(PUT)自然语言描述的情况下生成有效的测试用例。我们在MuTAP中采用不同的大语言模型,并在不同基准上评估其性能。实验结果表明,我们的方法能够检测出多达28%的人为编写的错误代码片段。其中,17%的错误代码片段未被当前最先进的完全自动化测试生成工具(即Pynguin)以及LLM的零样本/少样本学习方法检测到。此外,MuTAP在合成错误代码上达到了93.57%的变异得分(MS),优于我们评估中的所有其他方法。我们的发现表明,虽然LLM可作为生成测试用例的有用工具,但其需要特定的后处理步骤来提升生成测试用例的有效性,这些用例可能存在语法或功能错误,并且在检测特定类型缺陷和测试PUT边界情况时可能效果不佳。