Toward Training Superintelligent Software Agents through Self-Play SWE-RL

While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.

翻译：尽管当前基于大型语言模型（LLM）和代理强化学习（RL）的软件代理能够提升程序员的生产力，但其训练数据（如GitHub问题和拉取请求）和环境（如通过-未通过测试和故障-通过测试）严重依赖人类知识或人工策划，这构成了实现超级智能的根本性障碍。本文提出自我对弈SWE-RL（SSR），作为迈向超级智能软件代理训练范式的第一步。我们的方法仅需极少量数据假设，只需访问包含源代码和已安装依赖项的沙盒仓库，无需人工标注的问题或测试。以这些真实代码库为基础，通过强化学习在自我对弈场景中训练单一LLM代理，使其能够迭代式地注入并修复复杂度递增的软件缺陷——每个缺陷由测试补丁而非自然语言问题描述正式定义。在SWE-bench Verified和SWE-Bench Pro基准测试中，SSR实现了显著的自我提升（分别提升+10.4和+7.8个百分点），且在整个训练轨迹中持续优于基于人类数据的基线方法，尽管评估使用的是自我对弈中未出现的自然语言问题。尽管结果仍属早期阶段，这为代理能够自主从真实软件仓库中积累广泛学习经验铺平了道路，最终使超级智能系统在理解系统构建方式、解决新挑战以及从零开始自主创建新软件方面超越人类能力。