Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the existing GitHub issue resolution data construction pipeline is challenging and labor-intensive. We identify three key limitations in existing pipelines: (1) test patches collected often omit binary file changes; (2) the manual construction of evaluation environments is labor-intensive; and (3) the fail2pass validation phase requires manual inspection of test logs and writing custom parsing code to extract test status from logs. In this paper, we propose SWE-Factory, a fully automated issue resolution data construction pipeline, to resolve these limitations. First, our pipeline automatically recovers missing binary test files and ensures the correctness of test patches. Second, we introduce SWE-Builder, a LLM-based multi-agent system that automates evaluation environment construction. Third, we introduce a standardized, exit-code-based log parsing method to automatically extract test status, enabling a fully automated fail2pass validation. Experiments on 671 real-world GitHub issues across four programming languages show that our method can effectively construct valid evaluation environments for GitHub issues at a reasonable cost. For example, with GPT-4.1 mini, our SWE-Builder constructs 337 valid task instances out of 671 issues, at $0.047 per instance. Our ablation study further shows the effectiveness of different components of SWE-Builder. We also demonstrate through manual inspection that our exit-code-based fail2pass validation method is highly accurate, achieving an F1 score of 0.99. Additionally, we conduct an exploratory experiment to investigate whether we can use SWE-Factory to enhance models' software engineering ability.
翻译:为GitHub问题解决任务构建大规模数据集,对于训练和评估大语言模型(LLM)的软件工程能力至关重要。然而,现有的GitHub问题解决数据构建流程具有挑战性且劳动密集。我们识别出现有流程的三个关键局限:(1)收集的测试补丁常遗漏二进制文件变更;(2)评估环境的手动构建是劳动密集的;(3)fail2pass验证阶段需要人工检查测试日志并编写自定义解析代码以从日志中提取测试状态。本文提出SWE-Factory,一个全自动的问题解决数据构建流程,以解决这些局限。首先,我们的流程自动恢复缺失的二进制测试文件并确保测试补丁的正确性。其次,我们引入SWE-Builder,一个基于LLM的多智能体系统,用于自动化评估环境构建。第三,我们引入一种标准化的、基于退出码的日志解析方法来自动提取测试状态,从而实现全自动的fail2pass验证。在涵盖四种编程语言的671个真实世界GitHub问题上进行的实验表明,我们的方法能够以合理的成本为GitHub问题有效构建有效的评估环境。例如,使用GPT-4.1 mini,我们的SWE-Builder在671个问题中构建了337个有效的任务实例,每个实例成本为0.047美元。我们的消融研究进一步展示了SWE-Builder不同组件的有效性。我们还通过人工检查证明,我们基于退出码的fail2pass验证方法具有很高的准确性,达到了0.99的F1分数。此外,我们进行了一项探索性实验,以研究是否可以利用SWE-Factory来增强模型的软件工程能力。