Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. This paper presents TestPilot, an adaptive test generation technique that leverages Large Language Models (LLMs). TestPilot uses Codex, an off-the-shelf LLM, to automatically generate unit tests for a given program without requiring additional training or few-shot learning on examples of existing tests. In our approach, Codex is provided with prompts that include the signature and implementation of a function under test, along with usage examples extracted from documentation. If a generated test fails, TestPilot's adaptive component attempts to generate a new test that fixes the problem by re-prompting the model with the failing test and error message. We created an implementation of TestPilot for JavaScript and evaluated it on 25 npm packages with a total of 1,684 API functions to generate tests for. Our results show that the generated tests achieve up to 93.1% statement coverage (median 68.2%). Moreover, on average, 58.5% of the generated tests contain at least one assertion that exercises functionality from the package under test. Our experiments with excluding parts of the information included in the prompts show that all components contribute towards the generation of effective test suites. Finally, we find that TestPilot does not generate memorized tests: 92.7% of our generated tests have $\leq$ 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies.
翻译:单元测试在确保软件正确性方面发挥着关键作用。然而,手动编写单元测试是一项繁琐的任务,促使了自动化的需求。本文提出了TestPilot,一种利用大语言模型(LLMs)的自适应测试生成技术。TestPilot使用现成的大语言模型Codex,针对给定程序自动生成单元测试,无需额外训练或基于现有测试示例进行少样本学习。在我们的方法中,向Codex提供的提示包含被测试函数的签名和实现,以及从文档中提取的使用示例。如果生成的测试失败,TestPilot的自适应组件会通过向模型重新提供失败测试和错误信息作为提示,尝试生成能够修复问题的新测试。我们为JavaScript实现了TestPilot,并在25个npm包的共1,684个API函数上进行了评估。结果表明,生成的测试实现了高达93.1%的语句覆盖率(中位数为68.2%)。此外,平均有58.5%的生成测试包含至少一个能执行被测试包功能的断言。通过排除提示中包含的部分信息进行实验,我们发现所有组件都对生成有效的测试套件有所贡献。最后,我们发现TestPilot不会生成记忆性测试:92.7%的生成测试与现有测试的相似度≤50%(以归一化编辑距离衡量),且没有完全复制的测试。