Small language models (SLMs) offer compelling advantages in cost, latency, and adaptability, but have so far lagged behind larger models on long-horizon software engineering tasks such as SWE-bench, where they suffer from pervasive action looping and low resolution rates. We introduce SWE-Protégé, a post-training framework that reframes software repair as an expert-protégé collaboration problem. In SWE-Protégé, an SLM remains the sole decision-maker while learning to selectively seek guidance from a strong expert model, recognize stalled states, and follow through on expert feedback. Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration. We lightly post-train Qwen2.5-Coder-7B-Instruct to achieve 42.4% Pass@1 on SWE-bench Verified, a +25.4% improvement over the prior SLM state of the art, while using expert assistance sparsely (~4 calls per task and 11% of total tokens).
翻译:小语言模型(SLM)在成本、延迟和适应性方面具有显著优势,但在SWE-bench等长周期软件工程任务上一直落后于大型模型,普遍存在动作循环和解决率低的问题。我们提出了SWE-Protégé,一种将软件修复重构为专家-学徒协作问题的后训练框架。在SWE-Protégé中,SLM作为唯一决策者,学习有选择地向强大的专家模型寻求指导、识别停滞状态并执行专家反馈。我们的方法结合了对专家增强轨迹的监督微调与智能体强化学习,明确抑制退化循环和无效益的专家协作。通过对Qwen2.5-Coder-7B-Instruct进行轻量后训练,我们在SWE-bench Verified上实现了42.4%的Pass@1,较先前SLM最佳性能提升了25.4%,同时稀疏使用专家协助(约每个任务4次调用,占令牌总数的11%)。