Reward Hacking in Rubric-Based Reinforcement Learning

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

翻译：具有可验证奖励的强化学习已在数学和编程等领域实现了显著的后训练性能提升，然而许多开放式场景仍需依赖基于评分准则的奖励。本研究聚焦于基于评分准则的强化学习中的奖励篡改问题——即策略针对训练验证器进行优化，但最终需要通过跨族系的三个前沿评判模型进行联合评估，以降低对单一评估器的依赖。本文提出的框架区分了两种偏差来源：验证器失效（指训练验证器对其评分准则标准给予肯定，而参考验证器却予以否定的情况）和评分准则设计局限（指即使强大的基于评分准则的验证器，其所偏好的回答在无准则评估框架中的整体评分反而更差的情况）。在医学和科学领域，弱验证器会产生较大的代理奖励增益，但该增益无法迁移至参考验证器；这种利用性行为会随训练进程加剧，并集中在反复出现的失效模式中，例如对复合准则的部分满足、将隐式内容视为显式说明、以及主题匹配不精确。强验证器能显著减少但对消除验证器的利用性行为效果有限。我们引入了一个基于策略对数概率的无验证器诊断指标——自我内化缺口，该指标能够追踪参考验证器质量，并检测基于弱验证器训练的策略何时停止改进。最后，在本研究的设定下，当评分准则未能涵盖重要失败模式时，更强的验证无法阻止奖励篡改：基于准则的验证器偏好强化学习检查点，而无准则评估器则偏好基础模型。这种分歧恰好发生在性能提升集中于完整性和存在性准则，而事实正确性、简洁性、相关性及整体质量出现下降的场景。综合这些结果表明，更强验证虽能减轻奖励篡改，但仅凭其本身无法确保评分准则的增益能转化为更广泛的质量提升。