TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?

Test-driven development (TDD) is the practice of writing tests first and coding later, and the proponents of TDD expound its numerous benefits. For instance, given an issue on a source code repository, tests can clarify the desired behavior among stake-holders before anyone writes code for the agreed-upon fix. Although there has been a lot of work on automated test generation for the practice "write code first, test later", there has been little such automation for TDD. Ideally, tests for TDD should be fail-to-pass (i.e., fail before the issue is resolved and pass after) and have good adequacy with respect to covering the code changed during issue resolution. This paper introduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues mined from real-world GitHub code repositories. The benchmark's evaluation harness runs only relevant tests in isolation for simple yet accurate coverage measurements, and the benchmark's dataset is filtered both by human judges and by execution in the harness. This paper also presents Auto-TDD, an LLM-based solution that takes as input an issue description and a codebase (prior to issue resolution) and returns as output a test that can be used to validate the changes made for resolving the issue. Our evaluation shows that Auto-TDD yields a better fail-to-pass rate than the strongest prior work while also yielding high coverage adequacy. Overall, we hope that this work helps make developers more productive at resolving issues while simultaneously leading to more robust fixes.

翻译：测试驱动开发（TDD）是一种先编写测试后编写代码的实践，其倡导者阐述了该方法的诸多优势。例如，在源代码仓库中给定一个问题时，测试可以在任何人为商定的修复方案编写代码之前，帮助利益相关者明确期望的行为。尽管针对“先编写代码，后编写测试”这一实践已有大量自动化测试生成的研究，但针对TDD的此类自动化工作却很少。理想情况下，用于TDD的测试应当具备“失败转通过”的特性（即在问题解决前失败，在解决后通过），并且在覆盖问题解决期间修改的代码方面具有良好的充分性。本文介绍了TDD-Bench Verified，这是一个从真实GitHub代码仓库中挖掘出的包含449个问题的高质量基准测试套件。该基准测试的执行框架能够独立运行相关测试，从而实现简单而准确的覆盖率测量，且基准数据集已通过人工评审和框架执行的双重筛选。本文还提出了Auto-TDD，这是一种基于大型语言模型的解决方案，它以问题描述和代码库（在问题解决前）作为输入，并返回一个可用于验证为解决该问题所做更改的测试。我们的评估表明，与先前最强的工作相比，Auto-TDD在“失败转通过”率方面表现更优，同时也能产生较高的覆盖充分性。总体而言，我们希望这项工作能帮助开发者在解决问题时提高效率，同时获得更稳健的修复。