Generating unit tests is a crucial task in software development, demanding substantial time and effort from programmers. The advent of Large Language Models (LLMs) introduces a novel avenue for unit test script generation. This research aims to experimentally investigate the effectiveness of LLMs, specifically exemplified by ChatGPT, for generating unit test scripts for Python programs, and how the generated test cases compare with those generated by an existing unit test generator (Pynguin). For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. The generated test cases are evaluated based on criteria such as coverage, correctness, and readability. Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin. We also find that about a third of assertions generated by ChatGPT for some categories were incorrect. Our results also show that there is minimal overlap in missed statements between ChatGPT and Pynguin, thus, suggesting that a combination of both tools may enhance unit test generation performance. Finally, in our experiments, prompt engineering improved ChatGPT's performance, achieving a much higher coverage.
翻译:单元测试的生成是软件开发中的关键任务,需要程序员投入大量时间和精力。大型语言模型(LLMs)的出现为单元测试脚本生成提供了新途径。本研究旨在通过实验探究LLMs(以ChatGPT为例)为Python程序生成单元测试脚本的有效性,并将其生成的测试用例与现有单元测试生成器(Pynguin)进行对比。实验考虑了三种代码单元类型:1)过程式脚本;2)基于函数的模块化代码;3)基于类的代码。通过覆盖率、正确性和可读性等标准对生成的测试用例进行评估。结果表明:在覆盖率方面,ChatGPT的表现与Pynguin相当,部分情况下甚至优于Pynguin;同时发现ChatGPT在部分类别中生成的断言约三分之一存在错误。此外,ChatGPT与Pynguin在遗漏语句方面重叠极小,表明两者结合使用可能提升单元测试生成性能。最终实验表明,提示工程能显著提升ChatGPT的测试生成性能,实现更高的覆盖率。