Large language model (LLM) research in software engineering has largely focused on tasks such as code generation and bug repair. In practice, teams often draft multiple candidate proposals for fixing an issue and then deliberate on one golden proposal for implementation. This selection requires not only assessing the issue's scope, impact, and urgency, but also a clear understanding of each proposal's strengths and weaknesses. A good selection could make issue resolution more reliable while reducing regression and operational risk, whereas a poor choice can increase risk and even cause unpredictable failures. We first conduct a manual study of real-world issues to characterize the rationales maintainers use when selecting among competing proposals. Motivated by these findings, we introduce SWE-Manager, a joint selection and synthesis approach that selects the best proposal and synthesizes a golden proposal. SWE-Manager is an 8B model trained via reinforcement learning (RL) to compare proposals, justify its choice, and synthesize a golden proposal for implementation. We view proposal selection as a reasoning task, mirroring how technical managers review competing proposals by weighing issue context and each proposal's solution without executing code or running tests. On the SWE-Lancer Manager benchmark, SWE-Manager achieves 53.21 selection accuracy and 57.75 earn rate, earning 152,750 dollars and outperforming strong baselines including GPT-5. To further evaluate the effectiveness of SWE-Manager in real-world issue resolution, we design the P2A framework, which simulates a real-world workflow where multiple proposals are drafted, reviewed, and a golden proposal is selected for implementation ...
翻译:大型语言模型(LLM)在软件工程领域的研究主要集中在代码生成和错误修复等任务上。在实践中,团队通常会针对某个问题起草多个候选提案,随后经过审议确定一个黄金提案进行实施。这一选择过程不仅需要评估问题的范围、影响和紧迫性,还需清晰理解每个提案的优势与不足。良好的选择能够使问题解决更可靠,同时降低回归风险和操作风险;而糟糕的选择则可能增加风险,甚至导致不可预测的故障。我们首先通过人工研究现实世界中的问题,总结了维护者在多个竞争提案间进行选择时所依据的决策逻辑。基于这些发现,我们提出了SWE-Manager——一种联合选择与合成方法,能够选择最佳提案并合成黄金提案。SWE-Manager是一个通过强化学习(RL)训练的80亿参数模型,具备比较提案、论证其选择依据以及为实施合成黄金提案的能力。我们将提案选择视为一项推理任务,模拟技术管理者在不执行代码或运行测试的情况下,通过权衡问题背景及各提案解决方案来评审竞争提案的过程。在SWE-Lancer Manager基准测试中,SWE-Manager实现了53.21%的选择准确率和57.75%的收益比率,累计获得152,750美元收益,其表现超越了包括GPT-5在内的多个强基线模型。为进一步评估SWE-Manager在现实问题解决中的有效性,我们设计了P2A框架,该框架模拟了现实工作流程:起草多个提案、进行评审,并最终选择黄金提案实施……