LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop self-evolution framework that reuses the agent's historical solving traces as a source of training signal. Rather than treating traces only as evidence for reward computation, Socratic-SWE distills them into structured agent skills that summarize recurring failures and effective repair patterns. These skills then guide the generation of targeted repair tasks in real repositories. Candidate tasks are checked through execution-based validation and scored with a solver-gradient alignment reward, so that the retained tasks are both verifiable and useful for improving the Solver. The updated Solver produces new traces, enabling the task curriculum to adapt over successive rounds. Across SWE-bench Verified, SWE-bench Lite, SWE-bench Pro, and Terminal-Bench 2.0, Socratic-SWE consistently improves over self-evolving baselines under the same compute budget, reaching 50.40% on SWE-bench Verified after three iterations. These results suggest that solving traces can serve as a scalable substrate for self-evolving SWE agents.
翻译:基于大语言模型的软件工程智能体已成为评估真实世界语言模型能力的核心测试平台,但其训练仍受限于高质量SWE任务的可用性。现有合成数据方法通常通过固定变异或缺陷注入程序创建任务,导致生成的数据分布与智能体自身弱点及训练进程基本无关。我们提出Socratic-SWE,一种闭环自演化框架,通过复用智能体的历史求解轨迹作为训练信号源。不同于仅将轨迹作为奖励计算的证据,Socratic-SWE将其提炼为结构化智能体技能,用以总结重复性失败与有效修复模式。这些技能进而指导在真实代码仓库中生成针对性修复任务。候选任务通过基于执行的验证检查,并以求解器梯度对齐奖励进行评分,从而保留既可通过验证又有助于改进求解器的任务。更新后的求解器生成新轨迹,使任务课程能够在连续迭代中自适应调整。在SWE-bench Verified、SWE-bench Lite、SWE-bench Pro及Terminal-Bench 2.0基准测试中,Socratic-SWE在相同计算预算下持续优于自演化基线方法,经三次迭代后在SWE-bench Verified上达到50.40%。这些结果表明,求解轨迹可作为自演化SWE智能体的可扩展训练基板。