In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.
翻译:在强化学习中,指定能够准确捕捉预期任务的奖励函数可能极具挑战性。奖励学习旨在通过学习奖励函数来解决这一问题。然而,学习得到的奖励模型可能在数据分布上具有较低的误差,但随后却产生具有较大遗憾度的策略。我们将此类奖励模型称为存在误差-遗憾失配问题。误差-遗憾失配的主要来源是策略优化过程中普遍存在的分布偏移。本文通过数学证明表明:奖励模型的期望测试误差足够低时,可保证最坏情况下的遗憾度较低;但对于任意固定的期望测试误差,都存在允许误差-遗憾失配发生的现实数据分布。我们进一步证明,即使采用RLHF等方法中常用的策略正则化技术,类似问题依然存在。我们希望本研究能推动对改进奖励模型学习方法及建立更可靠质量评估体系的理论与实证探索。