No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation

Unit testing is essential in detecting bugs in functionally-discrete program units. Manually writing high-quality unit tests is time-consuming and laborious. Although traditional techniques can generate tests with reasonable coverage, they exhibit low readability and cannot be directly adopted by developers. Recent work has shown the large potential of large language models (LLMs) in unit test generation, which can generate more human-like and meaningful test code. ChatGPT, the latest LLM incorporating instruction tuning and reinforcement learning, has performed well in various domains. However, It remains unclear how effective ChatGPT is in unit test generation. In this work, we perform the first empirical study to evaluate ChatGPT's capability of unit test generation. Specifically, we conduct a quantitative analysis and a user study to systematically investigate the quality of its generated tests regarding the correctness, sufficiency, readability, and usability. The tests generated by ChatGPT still suffer from correctness issues, including diverse compilation errors and execution failures. Still, the passing tests generated by ChatGPT resemble manually-written tests by achieving comparable coverage, readability, and even sometimes developers' preference. Our findings indicate that generating unit tests with ChatGPT could be very promising if the correctness of its generated tests could be further improved. Inspired by our findings above, we propose ChatTESTER, a novel ChatGPT-based unit test generation approach, which leverages ChatGPT itself to improve the quality of its generated tests. ChatTESTER incorporates an initial test generator and an iterative test refiner. Our evaluation demonstrates the effectiveness of ChatTESTER by generating 34.3% more compilable tests and 18.7% more tests with correct assertions than the default ChatGPT.

翻译：单元测试在检测功能离散程序单元中的缺陷至关重要。手动编写高质量的单元测试既耗时又费力。尽管传统技术可以生成具有合理覆盖率的测试，但它们的可读性较低，无法直接被开发人员采用。最近的研究表明，大型语言模型（LLM）在单元测试生成方面具有巨大潜力，可以生成更接近人类且更有意义的测试代码。ChatGPT作为最新整合了指令调优和强化学习的LLM，在多个领域表现出色。然而，ChatGPT在单元测试生成中的有效性仍不清楚。本文首次通过实证研究评估ChatGPT在单元测试生成方面的能力。具体而言，我们进行了定量分析和用户研究，系统性地考察其生成测试在正确性、充分性、可读性和可用性方面的质量。ChatGPT生成的测试仍存在正确性问题，包括多种编译错误和执行失败。尽管如此，ChatGPT生成的通过测试在覆盖率和可读性方面与手动编写的测试相当，甚至有时能获得开发者的偏好。我们的发现表明，若其生成测试的正确性得到进一步提升，使用ChatGPT进行单元测试生成可能非常有前景。受上述发现启发，我们提出ChatTESTER，一种基于ChatGPT的新型单元测试生成方法，利用ChatGPT自身改进其生成测试的质量。ChatTESTER包含初始测试生成器和迭代测试优化器。评估结果显示，与默认的ChatGPT相比，ChatTESTER生成的编译通过测试增加了34.3%，断言正确的测试增加了18.7%。

相关内容

ChatGPT

关注 258

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日