Achieving mastery in real world software engineering tasks is fundamentally bottlenecked by the scarcity of large scale, high quality training data. Scaling such data has been limited by the complexity of environment setup, unit test generation, and problem statement curation. In this paper, we propose ScaleSWE, an automated, sandboxed multi agent workflow designed to construct high quality SWE data at scale. The system coordinates three specialized agents for environment setup, test creation, and problem description synthesis to process 6 million pull requests across 5200 repositories, producing Scale SWE Data: 100k verified SWE instances, the largest such dataset to date. It substantially surpasses existing real world datasets in repository diversity and reflects realistic task complexity. We further demonstrate the dataset utility for training by distilling 71498 high quality trajectories and finetuning Qwen30BA3BInstruct to produce ScaleSWE Agent. Our agent achieves a 64 resolve rate on SWE Bench Verified a nearly three fold improvement over the base model. ScaleSWE provides a scalable, reproducible approach for data construction to advance LLM based software engineering. Scale SWE will be publicly available.
翻译:[翻译后的中文摘要]
实现真实世界软件工程任务中的精通,其根本障碍在于缺乏大规模、高质量的训练数据。此类数据的扩展受限于环境设置的复杂性、单元测试的生成以及问题陈述的整理。在本文中,我们提出ScaleSWE,一个自动化的、沙盒化的多智能体工作流,旨在大规模构建高质量的软件工程数据。该系统协调三个专门化智能体——分别负责环境设置、测试创建和问题描述合成——以处理横跨5200个代码仓库的600万个拉取请求,从而生成ScaleSWE数据集:包含10万个经过验证的软件工程实例,这是迄今为止规模最大的此类数据集。该数据集在仓库多样性方面显著超越了现有的真实世界数据集,并反映了实际任务的复杂性。我们进一步通过蒸馏71498条高质量轨迹并对Qwen30B-A3B-Instruct模型进行微调,训练出ScaleSWE Agent,以证明该数据集的训练效用。我们的智能体在SWE Bench Verified基准测试中达到了64%的解决率,这几乎是基础模型性能的三倍提升。ScaleSWE为推进基于大语言模型的软件工程提供了一种可扩展、可复现的数据构建方法。ScaleSWE将对外公开。