Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind that of execution feedback due to the prohibitive costs of the full RLVR pipeline. In this work, we ablate three primary choices along the performance-cost trade-off in RLVR: intermediate thinking traces, learning from negative samples, and on-policy training. We introduce Aletheia, a controlled, execution-grounded testbed to facilitate a contamination-free analysis of code verifier training recipes across disparate model sizes and covariate shifts across two common verifier application scenarios. Our analysis reveals that the optimal training recipe is scale-dependent: on-policy learning is the primary performance driver for small verifiers, whereas the thinking budget becomes the most vital factor at larger scales. While leveraging negative samples has a consistent impact on top-1 selection accuracy across sizes, their contribution to ranking reconstruction increases monotonically with scale and plays a key role in stabilizing training at large sizes. Our Pareto optimality analysis demonstrates that eliminating on-policy training at larger model scales yields a verifier that performs comparably to the full RLVR recipe. Furthermore, we find that eschewing thinking traces serves as a compute-efficient strategy at lower budgets, offering a strong trade-off between training cost and verifier accuracy. Ultimately, our work provides the empirical foundation necessary to efficiently deploy robust code verifiers, thereby enabling their wider adoption in post-training pipelines for large code generation models.
翻译:通过可验证奖励强化学习(RLVR)训练的多领域思考验证器是现代后训练过程的基石。然而,由于完整RLVR流程的过高成本,其在代码生成领域的应用落后于执行反馈方法。本研究从性能-成本权衡角度消融了RLVR中的三个主要选择:中间思考轨迹、负样本学习与同策略训练。我们提出Aletheia——一个受控的、基于执行结果的可控测试平台,以便在不同模型规模及两种常见验证器应用场景的协变量偏移下,对代码验证器训练方案进行无污染分析。分析表明最优训练策略具有规模依赖性:对小规模验证器,同策略学习是性能的主要驱动力;而对大规模验证器,思考预算则成为最关键因素。虽然利用负样本对不同规模验证器的top-1选择准确率有持续影响,但负样本对排序重建的贡献随规模单调递增,并在大规模训练中发挥稳定训练过程的关键作用。帕累托最优性分析表明:在较大模型规模下取消同策略训练,可得到性能与完整RLVR方案相当的验证器。此外,我们发现放弃思考轨迹在较低预算下是一种计算高效策略,能在训练成本与验证器准确率之间实现强效平衡。最终,本研究为高效部署稳健代码验证器提供了必要的经验基础,从而推动其在大规模代码生成模型后训练流程中的广泛应用。