Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces. Our code is available at: https://github.com/songtao-x/reward_hack.
翻译:可验证奖励的强化学习(RLVR)通常针对结果奖励进行优化,而不对中间推理过程施加约束。这使得训练容易受到奖励破解的影响,即模型利用奖励函数中的漏洞(例如训练数据中的虚假模式)获得高分,却未能解决预期任务。这些奖励破解行为往往是隐含的,因为中间思维链(CoT)表面上可能看似合理,限制了纯文本监控的有效性。我们提出梯度指纹(GRIFT),这是一种利用模型内部计算来检测奖励破解的方法。给定提示和模型生成的CoT,GRIFT计算以提示为条件的CoT梯度,并将其压缩为紧凑表示,进而用于评估该CoT是否反映了奖励破解行为。在涵盖数学、代码和逻辑推理的可验证推理基准测试中,GRIFT显著优于包括CoT Monitor和TRACE在内的强基线方法,在检测奖励破解行为方面实现了超过25%的相对性能提升。此外,将GRIFT集成到推理任务的拒绝微调流程中,可减少奖励破解并提升真实任务目标上的性能。我们的结果表明,利用梯度级表示来评估CoT推理轨迹质量是一个有前景的研究方向。我们的代码可在:https://github.com/songtao-x/reward_hack 获取。