SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,079 tasks spanning 20 languages and 3,617 repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.

翻译：软件工程智能体（SWE）正快速发展，近期进展主要由强化学习（RL）驱动。然而，RL训练受到大规模任务集合稀缺性的制约，这些任务需要具备可复现的执行环境和可靠的测试套件。尽管已有越来越多的基准测试出现，但适合训练的数据集在规模和多样性上仍十分有限，或者往往仅针对有限的高资源语言生态系统。我们提出SWE-rebench V2，这是一个语言无关的自动化流水线，能够大规模收集可执行的真实世界SWE任务并构建RL训练环境。该流水线通过交互式设置智能体合成仓库特定的安装与测试流程，并使用集成的大语言模型评判器过滤无效实例，这些评判器已通过人工验证的SWE-bench注释进行验证。利用该流水线，我们构建了一个包含20种语言、3617个仓库共32079个任务的数据集，并附带预构建镜像以实现可复现执行。为进一步扩展训练数据，我们额外发布了12万+个任务，包含安装说明、失败-通过测试及丰富元数据，其中问题陈述基于原始拉取请求描述生成。我们通过诊断研究验证了所收集的实例，该研究覆盖了五种编程语言中七个流行模型的任务子集，并提供了实例级元数据以标记常见干扰因素，如过于严格的测试和描述不清的问题。我们公开了数据集、收集与执行代码及相关工件，以支持跨多种语言和仓库的大规模SWE智能体训练。