The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.
翻译:基于人类反馈的强化学习(RLHF)的成功与否,关键取决于奖励模型的质量。然而,尽管这种质量主要通过准确性来评估,但准确性是否完全捕捉了奖励模型作为有效指导者的特质,目前尚不清楚。我们从优化的角度来探讨这个问题。首先,我们证明,无论一个奖励模型多么准确,如果它诱导的奖励方差较低,那么RLHF的目标函数就会面临平坦的优化地形。因此,即使是一个完全准确的奖励模型,也可能导致优化过程极其缓慢,其表现甚至不如那些准确性较低但能诱导更高奖励方差的模型。我们还进一步表明,对于一个语言模型表现良好的奖励模型,可能对另一个语言模型诱导出较低的奖励方差,从而导致平坦的目标函数地形。这些结果确立了仅基于准确性或独立于其所指导的语言模型来评估奖励模型的基本局限性。使用参数规模高达80亿的模型进行的实验证实了我们的理论,展示了奖励方差、准确性与奖励最大化速率之间的相互作用。总体而言,我们的研究结果强调,除了准确性之外,奖励模型还需要诱导足够的方差以实现高效优化。