Automated test generation has a substantial body of work, yet most studies focus on generating tests for complete software units, such as classes, and rely on metrics such as code coverage for assessment. In contrast, modern software development primarily evolves through small, targeted changes introduced in pull requests (PRs). Despite this, the crucial task of generating tests specifically for these PRs has been overlooked, and the performance of state-of-the-art tools for this purpose remains unknown. This study evaluates two distinct approaches for PR-aware test generation: EvoSuite, a leading search-based tool, and GPT-4o, one of the widely used large language models (LLMs). To measure their effectiveness at validating PR-specific changes, we assess their ability to generate fail-to-pass (F2P) test cases, meaning tests that fail on the code before the change and pass on the code after the change. Our evaluation shows that EvoSuite outperformed GPT-4o, producing at least one F2P test for a significantly higher percentage of PRs (36 percent vs. 13 percent). The performance of GPT-4o was significantly hampered by a high rate of compilation errors (63 percent), whereas only 2 percent of EvoSuite's generated tests failed to run. Despite EvoSuite's relative success, our findings indicate that both tools are largely ineffective for this task, as they failed to generate any meaningful change-capturing tests for the large majority of the PRs (64 percent). Although both generators could not achieve a high F2P ratio in our evaluation, and EvoSuite outperformed GPT-4o, we believe that agentic code generation methods may have significant potential for this task. Ultimately, our work highlights a critical gap in tooling and calls for the development of high-performance test generators tailored to the incremental nature of modern software development.
翻译:自动化测试生成已有大量研究工作,但多数研究聚焦于为完整软件单元(如类)生成测试,并依赖代码覆盖率等指标进行评估。然而,现代软件开发主要通过拉取请求(PR)引入的小规模、针对性变更进行迭代。尽管这一模式普遍存在,针对拉取请求生成测试的关键任务却被忽视,现有最先进工具在此场景下的性能尚不明确。本研究评估了两种面向PR的测试生成方法:基于搜索的代表性工具EvoSuite与广泛使用的大语言模型(LLM)GPT-4o。为衡量其验证PR特定变更的有效性,我们评估两者生成“失败-通过”(F2P)测试用例的能力——即对变更前代码失败、对变更后代码通过的测试。评估结果显示:EvoSuite显著优于GPT-4o,能在更高比例的PR中至少生成一个F2P测试(36% vs. 13%)。GPT-4o的性能受高编译错误率(63%)严重制约,而EvoSuite仅有2%的生成测试无法运行。尽管EvoSuite相对成功,但本研究发现两者在该任务中基本无效——对绝大多数PR(64%)均未能生成任何有意义的变更捕获测试。虽然两个生成器在评估中均未获得高F2P比率,且EvoSuite优于GPT-4o,但我们认为基于智能体的代码生成方法可能在此任务中具备显著潜力。最终,本研究揭示了工具链中的关键缺口,呼吁开发适配现代软件增量式开发特性的高性能测试生成器。