No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation

Unit testing plays an essential role in detecting bugs in functionally-discrete program units (e.g., methods). Manually writing high-quality unit tests is time-consuming and laborious. Although the traditional techniques are able to generate tests with reasonable coverage, they are shown to exhibit low readability and still cannot be directly adopted by developers in practice. Recent work has shown the large potential of large language models (LLMs) in unit test generation. By being pre-trained on a massive developer-written code corpus, the models are capable of generating more human-like and meaningful test code. \chatgpt{}, the latest LLM that further incorporates instruction tuning and reinforcement learning, has exhibited outstanding performance in various domains. To date, it still remains unclear how effective ChatGPT is in unit test generation. In this work, we perform the first empirical study to evaluate ChatGPT 's capability of unit test generation. In particular, we conduct both a quantitative analysis and a user study to systematically investigate the quality of its generated tests in terms of correctness, sufficiency, readability, and usability. We find that the tests generated by ChatGPT still suffer from correctness issues, including diverse compilation errors and execution failures (mostly caused by incorrect assertions); but the passing tests generated by ChatGPT almost resemble manually-written tests by achieving comparable coverage, readability, and even sometimes developers' preference. Our findings indicate that generating unit tests with ChatGPT could be very promising if the correctness of its generated tests could be further improved.

翻译：单元测试在检测功能离散的程序单元（如方法）中的错误方面起着重要作用。手动编写高质量的单元测试既耗时又费力。尽管传统技术能够生成具有合理覆盖率的测试，但研究表明它们可读性较低，仍无法被开发人员直接采用于实践中。近期工作显示，大型语言模型（LLMs）在单元测试生成方面具有巨大潜力。通过在大量开发者编写的代码语料上进行预训练，这些模型能够生成更贴近人类、更有意义的测试代码。ChatGPT作为最新的大型语言模型，融入了指令微调和强化学习，已在多个领域展现出卓越性能。迄今为止，ChatGPT在单元测试生成中的有效性仍不明确。在本研究中，我们首次通过实证研究评估ChatGPT生成单元测试的能力。具体而言，我们同时进行了定量分析和用户研究，系统性地考察其生成的测试在正确性、充分性、可读性和可用性方面的质量。我们发现，ChatGPT生成的测试仍存在正确性问题，包括多种编译错误和执行失败（主要由不正确的断言引起）；但ChatGPT生成的通过测试几乎与手动编写的测试相当，达到了可比的覆盖率、可读性，甚至有时更受开发者青睐。我们的研究结果表明，若能进一步改善生成测试的正确性，利用ChatGPT生成单元测试将极具潜力。