Issue2Test: Generating Reproducing Test Cases from Issue Reports

Automated tools for solving GitHub issues are receiving significant attention by both researchers and practitioners, e.g., in the form of foundation models and LLM-based agents prompted with issues. A crucial step toward successfully solving an issue is creating a test case that accurately reproduces the issue. Such a test case can guide the search for an appropriate patch and help validate whether the patch matches the issue's intent. However, existing techniques for issue reproduction show only moderate success. This paper presents Issue2Test, an LLM-based technique for automatically generating a reproducing test case for a given issue report. Unlike automated regression test generators, which aim at creating passing tests, our approach aims at a test that fails, and that fails specifically for the reason described in the issue. To this end, Issue2Test performs three steps: (1) understand the issue and gather context (e.g., related files and project-specific guidelines) relevant for reproducing it; (2) generate a candidate test case; and (3) iteratively refine the test case based on compilation and runtime feedback until it fails and the failure aligns with the problem described in the issue. We evaluate Issue2Test on the SWT-bench-lite dataset, where it successfully reproduces 32.9% of the issues, achieving a 16.3% relative improvement over the best existing technique. Our evaluation also shows that Issue2Test reproduces 20 issues that four prior techniques fail to address, contributing a total of 60.4% of all issues reproduced by these tools. We envision our approach to contribute to enhancing the overall progress in the important task of automatically solving GitHub issues.

翻译：用于解决GitHub问题的自动化工具正受到研究人员和实践者的广泛关注，例如以基础模型和基于LLM的智能体形式，通过问题提示来工作。成功解决问题的关键一步是创建能够准确复现问题的测试用例。此类测试用例可以指导寻找合适的补丁，并帮助验证补丁是否符合问题的意图。然而，现有的问题复现技术仅显示出有限的成功率。本文提出了Issue2Test，一种基于LLM的技术，用于自动为给定问题报告生成复现测试用例。与旨在创建通过测试的自动化回归测试生成器不同，我们的方法旨在生成一个会失败的测试，并且该失败正是由于问题描述的原因所致。为此，Issue2Test执行三个步骤：（1）理解问题并收集与复现相关的上下文（例如，相关文件和项目特定指南）；（2）生成候选测试用例；（3）基于编译和运行时反馈迭代优化测试用例，直到其失败且失败原因与问题描述一致。我们在SWT-bench-lite数据集上评估Issue2Test，其成功复现了32.9%的问题，相对于现有最佳技术实现了16.3%的相对提升。我们的评估还显示，Issue2Test复现了四个先前技术未能处理的20个问题，占这些工具复现问题总数的60.4%。我们期望我们的方法有助于推动在自动解决GitHub问题这一重要任务上的整体进展。