Test suites in real-world projects are often large and achieve high code coverage, yet they remain insufficient for detecting all bugs. The abundance of unresolved issues in open-source project trackers highlights this gap. While regression tests are typically designed to ensure past functionality is preserved in the new version, they can also serve a complementary purpose: debugging the current version. Specifically, regression tests can (1) enhance the generation of reproduction tests for newly reported issues, and (2) validate that patches do not regress existing functionality. We present TestPrune, a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation. A key contribution of TestPrune is its ability to automatically minimize the regression suite to a small, highly relevant subset of tests. Due to the predominance of LLM-based debugging techniques, this minimization is essential as large test suites exceed context limits, introduce noise, and inflate inference costs. TestPrune can be plugged into any agentic bug repair pipeline and orthogonally improve overall performance. As a proof of concept, we show that TestPrune leads to a 6.2%-9.0% relative increase in issue reproduction rate within the Otter framework and a 9.4% - 12.9% relative increase in issue resolution rate within the Agentless framework on SWE-Bench Lite and SWE-Bench Verified benchmarks, capturing fixes that were correctly produced by agents but not submitted as final patches. Compared to the benefits, the cost overhead of using TestPrune is minimal, i.e., \$0.02 and \$0.05 per SWE-Bench instance, using GPT-4o and Claude-3.7-Sonnet models, respectively.
翻译:现实项目中的测试套件通常规模庞大且代码覆盖率较高,但依然不足以检测所有缺陷。开源项目追踪器中大量未解决问题的存在凸显了这一差距。虽然回归测试通常旨在确保新版本保留原有功能,但它们也可发挥补充作用:调试当前版本。具体而言,回归测试能够(1)增强针对新报告问题的复现测试生成能力,以及(2)验证补丁不会导致现有功能退化。本文提出TestPrune——一种全自动技术,该技术利用问题追踪报告并策略性地复用回归测试,同时服务于缺陷复现和补丁验证。TestPrune的核心贡献在于其能够自动将回归测试套件精简至一个规模小、相关性高的测试子集。鉴于当前基于大语言模型的调试技术占据主导地位,这种精简至关重要,因为大型测试套件会超出上下文限制、引入噪声并推高推理成本。TestPrune可嵌入任何智能体化缺陷修复流程中,以正交方式提升整体性能。作为概念验证,我们证明在SWE-Bench Lite和SWE-Bench Verified基准测试中,TestPrune使Otter框架内问题复现率相对提升6.2%-9.0%,并在Agentless框架内使问题解决率相对提升9.4%-12.9%,成功捕获了智能体正确生成但未作为最终补丁提交的修复方案。相较于其带来的效益,使用TestPrune的成本开销极低——在GPT-4o和Claude-3.7-Sonnet模型上,每个SWE-Bench实例分别仅需0.02美元和0.05美元。