Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities. In this paper, we propose TESTEVAL, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate sixteen popular LLMs, including both commercial and open-source ones, on TESTEVAL. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs, indicating a lack of ability to comprehend program logic and execution paths. We have open-sourced our dataset and benchmark pipelines at https://llm4softwaretesting.github.io to contribute and accelerate future research on LLMs for software testing.
翻译:测试在软件开发周期中扮演着至关重要的角色,它能够帮助发现程序错误、安全漏洞及其他非预期行为。进行软件测试时,测试人员需要编写代码片段来执行被测程序。近年来,研究者已认识到大语言模型在软件测试领域的潜力。然而,目前仍缺乏针对不同大语言模型在测试用例生成能力方面的公平比较。本文提出TESTEVAL,一个面向大语言模型测试用例生成的新型基准评测框架。我们从在线编程平台LeetCode收集了210个Python程序,并设计了三种不同的评测任务:整体覆盖率、目标行/分支覆盖率以及目标路径覆盖率。我们进一步在TESTEVAL上评估了十六个主流大语言模型,包括商业模型和开源模型。研究发现,为覆盖特定程序行/分支/路径生成测试用例对当前的大语言模型而言仍具挑战性,这表明模型在理解程序逻辑与执行路径方面存在不足。我们已在https://llm4softwaretesting.github.io开源了数据集与基准评测流程,以推动和加速未来基于大语言模型的软件测试研究。