Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop reward model evaluation metrics that account for the harmfulness of reward errors. Compared to standard ranking accuracy, these metrics typically correlate better with the performance of a language model after RLHF, yet gaps remain in robustly evaluating reward models. Second, we provide insights for reward design in settings with verifiable rewards. A key theme underlying our results is that the effectiveness of a proxy reward function depends heavily on its interaction with the initial policy and learning algorithm.
翻译:通过强化学习训练语言模型通常依赖不完美的代理奖励,因为精确定义目标行为的真实奖励几乎不可得。标准代理奖励质量评估指标(如排序准确率)将错误奖励视为严格有害。然而,本研究强调并非所有与真实奖励的偏差都同等重要。通过理论分析策略梯度优化过程中哪些输出会获得概率提升,我们根据奖励错误对真实奖励提升效果的影响对其进行分类。分析表明,尽管传统观点认为奖励错误有害,但它们也可能具有良性甚至有益作用——通过防止策略在真实奖励平庸的输出附近停滞。我们随后提出该理论的两项实践启示。其一,面向人类反馈强化学习(RLHF),我们开发了能评估奖励错误危害程度的奖励模型评价指标。相较于标准排序准确率,这些指标通常与RLHF后语言模型性能的相关性更强,但在稳健评估奖励模型方面仍存在差距。其二,我们为具有可验证奖励场景下的奖励设计提供见解。支撑我们结果的核心主题是:代理奖励函数的有效性高度依赖于其与初始策略及学习算法的交互方式。