Unit testing is essential in detecting bugs in functionally-discrete program units. Manually writing high-quality unit tests is time-consuming and laborious. Although traditional techniques can generate tests with reasonable coverage, they exhibit low readability and cannot be directly adopted by developers. Recent work has shown the large potential of large language models (LLMs) in unit test generation, which can generate more human-like and meaningful test code. ChatGPT, the latest LLM incorporating instruction tuning and reinforcement learning, has performed well in various domains. However, It remains unclear how effective ChatGPT is in unit test generation. In this work, we perform the first empirical study to evaluate ChatGPT's capability of unit test generation. Specifically, we conduct a quantitative analysis and a user study to systematically investigate the quality of its generated tests regarding the correctness, sufficiency, readability, and usability. The tests generated by ChatGPT still suffer from correctness issues, including diverse compilation errors and execution failures. Still, the passing tests generated by ChatGPT resemble manually-written tests by achieving comparable coverage, readability, and even sometimes developers' preference. Our findings indicate that generating unit tests with ChatGPT could be very promising if the correctness of its generated tests could be further improved. Inspired by our findings above, we propose ChatTESTER, a novel ChatGPT-based unit test generation approach, which leverages ChatGPT itself to improve the quality of its generated tests. ChatTESTER incorporates an initial test generator and an iterative test refiner. Our evaluation demonstrates the effectiveness of ChatTESTER by generating 34.3% more compilable tests and 18.7% more tests with correct assertions than the default ChatGPT.
翻译:单元测试在检测功能离散程序单元中的缺陷至关重要。手动编写高质量的单元测试既耗时又费力。尽管传统技术可以生成具有合理覆盖率的测试,但它们的可读性较低,无法直接被开发人员采用。最近的研究表明,大型语言模型(LLM)在单元测试生成方面具有巨大潜力,可以生成更接近人类且更有意义的测试代码。ChatGPT作为最新整合了指令调优和强化学习的LLM,在多个领域表现出色。然而,ChatGPT在单元测试生成中的有效性仍不清楚。本文首次通过实证研究评估ChatGPT在单元测试生成方面的能力。具体而言,我们进行了定量分析和用户研究,系统性地考察其生成测试在正确性、充分性、可读性和可用性方面的质量。ChatGPT生成的测试仍存在正确性问题,包括多种编译错误和执行失败。尽管如此,ChatGPT生成的通过测试在覆盖率和可读性方面与手动编写的测试相当,甚至有时能获得开发者的偏好。我们的发现表明,若其生成测试的正确性得到进一步提升,使用ChatGPT进行单元测试生成可能非常有前景。受上述发现启发,我们提出ChatTESTER,一种基于ChatGPT的新型单元测试生成方法,利用ChatGPT自身改进其生成测试的质量。ChatTESTER包含初始测试生成器和迭代测试优化器。评估结果显示,与默认的ChatGPT相比,ChatTESTER生成的编译通过测试增加了34.3%,断言正确的测试增加了18.7%。