Immersion in the GitHub Universe: Scaling Coding Agents to Mastery

Jiale Zhao,Guoxin Chen,Fanzhe Meng,Minghao Li,Jie Chen,Hui Xu,Yongshuai Sun,Wayne Xin Zhao,Ruihua Song,Yuan Zhang,Peng Wang,Cheng Chen,Jirong Wen,Kai Jia

Achieving mastery in real world software engineering tasks is fundamentally bottlenecked by the scarcity of large scale, high quality training data. Scaling such data has been limited by the complexity of environment setup, unit test generation, and problem statement curation. In this paper, we propose ScaleSWE, an automated, sandboxed multi agent workflow designed to construct high quality SWE data at scale. The system coordinates three specialized agents for environment setup, test creation, and problem description synthesis to process 6 million pull requests across 5200 repositories, producing Scale SWE Data: 100k verified SWE instances, the largest such dataset to date. It substantially surpasses existing real world datasets in repository diversity and reflects realistic task complexity. We further demonstrate the dataset utility for training by distilling 71498 high quality trajectories and finetuning Qwen30BA3BInstruct to produce ScaleSWE Agent. Our agent achieves a 64 resolve rate on SWE Bench Verified a nearly three fold improvement over the base model. ScaleSWE provides a scalable, reproducible approach for data construction to advance LLM based software engineering. Scale SWE will be publicly available.

翻译：[翻译后的中文摘要] 实现真实世界软件工程任务中的精通，其根本障碍在于缺乏大规模、高质量的训练数据。此类数据的扩展受限于环境设置的复杂性、单元测试的生成以及问题陈述的整理。在本文中，我们提出ScaleSWE，一个自动化的、沙盒化的多智能体工作流，旨在大规模构建高质量的软件工程数据。该系统协调三个专门化智能体——分别负责环境设置、测试创建和问题描述合成——以处理横跨5200个代码仓库的600万个拉取请求，从而生成ScaleSWE数据集：包含10万个经过验证的软件工程实例，这是迄今为止规模最大的此类数据集。该数据集在仓库多样性方面显著超越了现有的真实世界数据集，并反映了实际任务的复杂性。我们进一步通过蒸馏71498条高质量轨迹并对Qwen30B-A3B-Instruct模型进行微调，训练出ScaleSWE Agent，以证明该数据集的训练效用。我们的智能体在SWE Bench Verified基准测试中达到了64%的解决率，这几乎是基础模型性能的三倍提升。ScaleSWE为推进基于大语言模型的软件工程提供了一种可扩展、可复现的数据构建方法。ScaleSWE将对外公开。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

AUTOLAB：86亿Token实测前沿模型的长程自动科研能力

专知会员服务

18+阅读 · 6月12日

Agent Harness综述：大模型智能体执行器工程全景

专知会员服务

27+阅读 · 5月28日

DeepSeek-V4：百万 Token 上下文背后，大模型正在进入“长程智能”时代（附中英文pdf版）

专知会员服务

24+阅读 · 4月24日

《Hello-Agents》项目正式发布，一起从零学习智能体！

专知会员服务

31+阅读 · 1月2日