Achieving mastery in real world software engineering tasks is fundamentally bottlenecked by the scarcity of large scale, high quality training data. Scaling such data has been limited by the complexity of environment setup, unit test generation, and problem statement curation. In this paper, we propose ScaleSWE, an automated, sandboxed multi agent workflow designed to construct high quality SWE data at scale. The system coordinates three specialized agents for environment setup, test creation, and problem description synthesis to process 6 million pull requests across 5200 repositories, producing Scale SWE Data: 100k verified SWE instances, the largest such dataset to date. It substantially surpasses existing real world datasets in repository diversity and reflects realistic task complexity. We further demonstrate the dataset utility for training by distilling 71498 high quality trajectories and finetuning Qwen30BA3BInstruct to produce ScaleSWE Agent. Our agent achieves a 64 resolve rate on SWE Bench Verified a nearly three fold improvement over the base model. ScaleSWE provides a scalable, reproducible approach for data construction to advance LLM based software engineering. Scale SWE will be publicly available.
翻译:在现实世界软件工程任务中实现精通,根本上受限于大规模高质量训练数据的稀缺性。此类数据的规模化构建一直受限于环境配置、单元测试生成和问题描述构建的复杂性。本文提出ScaleSWE,一种自动化、沙盒化的多智能体工作流,旨在规模化构建高质量软件工程数据。该系统协调三个专用智能体分别负责环境配置、测试创建和问题描述合成,处理了5200个代码库中的600万个拉取请求,生成了Scale SWE Data:包含10万个已验证软件工程实例,是迄今规模最大的同类数据集。该数据集在代码库多样性方面显著超越现有现实世界数据集,并反映了真实的任务复杂度。我们进一步通过提炼71498条高质量轨迹数据并微调Qwen30BA3BInstruct模型生成ScaleSWE智能体,验证了数据集的训练效用。该智能体在SWE Bench Verified基准测试中达到64%的解决率,较基础模型提升近三倍。ScaleSWE为推进基于大语言模型的软件工程研究提供了可扩展、可复现的数据构建方法。Scale SWE数据集将公开提供。