While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P's practicality.
翻译:尽管大型语言模型智能体在软件工程任务上取得了进展,但现有基于测试的监督方法不可扩展的特性限制了数据规模扩大的潜在改进空间。原因有二:(1) 构建和运行测试沙箱较为繁重且脆弱;(2) 具有高覆盖率测试的数据天然稀缺,且易受通过边界案例进行的测试攻击威胁。本文提出R4P,一种补丁验证模型,通过推理为训练和测试软件工程智能体提供可扩展的奖励。我们认为补丁验证本质上是一项推理任务,类似于人类仓库维护者在无需编写和运行新复现测试的情况下审查补丁。为获取充分参考并降低奖励攻击风险,R4P采用分组目标进行强化学习训练,使其能够验证多个补丁之间的修改关系,并获得密集奖励以实现稳定训练。R4P在验证来自SWE-bench-verified的补丁时达到72.2%的准确率,超越了OpenAI o3。为展示R4P的实用性,我们设计并训练了一个轻量级框架Mini-SE,采用纯强化学习方法,所有奖励均来自R4P。结果显示,Mini-SE在SWE-bench-verified上实现了26.2%的Pass@1,较原始Qwen3-32B提升了10.0%。若在测试时结合R4P进行扩展,该指标可进一步提升至32.8%。此外,R4P可在1秒内完成补丁验证,平均速度比传统测试方法快50倍。奖励与准确率的稳定扩展曲线及高效率体现了R4P的实用价值。