Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit test generation have been developed. However, despite this effort, usable tools exist for very few programming languages. Moreover, studies have found that automatically generated tests suffer poor readability and do not resemble developer-written tests. In this work, we present a rigorous investigation of how large language models (LLMs) can help bridge the gap. We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We illustrate how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. We conducted an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness -- evaluating them on standard as well as enterprise Java applications and a large Python benchmark. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved while also producing considerably more natural test cases that developers find easy to understand. We also present the results of a user study, conducted with 161 professional developers, that highlights the naturalness characteristics of the tests generated by our approach.
翻译:实现自动化单元测试是软件开发中一项重要但耗时的活动。为协助开发者完成此任务,已开发出多种自动化单元测试生成技术。然而,尽管付出了这些努力,目前仅对极少数编程语言存在可用工具。此外,研究发现自动生成的测试可读性较差,且与开发者编写的测试存在明显差异。本研究对大语言模型如何帮助弥合这一差距进行了严谨探讨。我们提出了一种通用流程,该流程结合静态分析来引导大语言模型生成可编译且高覆盖率的测试用例。我们阐述了该流程如何应用于不同编程语言(特别是Java和Python)以及需要环境模拟的复杂软件。我们通过实证研究评估了生成测试在代码覆盖率和测试自然度方面的质量——在标准及企业级Java应用与大型Python基准测试集上进行验证。结果表明,在静态分析引导下,基于大语言模型的测试生成在覆盖率指标上可与当前最先进的测试生成技术相竞争,甚至表现更优,同时能产生显著更自然、易于开发者理解的测试用例。我们还呈现了针对161名专业开发者的用户研究结果,该研究凸显了我们方法生成测试的自然度特征。