Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there has been far less effort dedicated to benchmarking software testing, despite the strong correlation between well-tested software and effective bug detection. To address this gap, we create and release TestGenEval, a large-scale benchmark to measure test generation performance. Based on SWEBench, TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories. It covers initial tests authoring, test suite completion, and code coverage improvements. Test authoring simulates the process of a developer writing a test suite from scratch, while test completion mimics the scenario where a developer aims to improve the coverage of an existing test suite. We evaluate several popular models, with sizes ranging from 7B to 405B parameters. Our detailed analysis highlights TestGenEval's contribution to a comprehensive evaluation of test generation performance. In particular, models struggle to generate high-coverage test suites, with the best model, GPT-4o, achieving an average coverage of only 35.2%. This is primarily due to models struggling to reason about execution, and their frequent assertion errors when addressing complex code paths.
翻译:代码生成模型能够帮助改进从代码补全到缺陷预测等多种常见软件任务。现有针对代码生成大语言模型(LLM)的基准大多聚焦于代码编写或代码补全。尽管充分测试的软件与有效的错误检测之间存在强相关性,但令人惊讶的是,针对软件测试的基准评测工作却远远不足。为填补这一空白,我们创建并发布了TestGenEval——一个用于衡量测试生成性能的大规模基准。基于SWEBench,TestGenEval包含来自11个维护良好的Python仓库中1,210个代码与测试文件对的68,647个测试用例,涵盖了初始测试编写、测试套件补全以及代码覆盖率提升等场景。测试编写模拟开发者从零开始编写测试套件的过程,而测试补全则模拟开发者旨在提升现有测试套件覆盖率的场景。我们评估了多个参数量从7B到405B的流行模型。我们的详细分析凸显了TestGenEval对全面评估测试生成性能的贡献。特别值得注意的是,模型在生成高覆盖率测试套件方面表现困难,最佳模型GPT-4o的平均覆盖率仅为35.2%。这主要源于模型难以进行执行推理,且在处理复杂代码路径时频繁出现断言错误。