Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a principled manner by identifying the source of the misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model; one that simultaneously minimizes the maximum likelihood estimation of the loss and a reward penalty term. Here, the reward penalty term is introduced to prevent the policy from choosing actions with spurious high proxy rewards, resulting in provable sample efficiency of the algorithm under a partial coverage style condition. Moving from theory to practice, the proposed algorithm further enjoys an equivalent but surprisingly easy-to-implement reformulation. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines: (i) a preference optimization loss that directly aligns the policy with human preference, and (ii) a supervised learning loss that explicitly imitates the policy with a (suitable) baseline distribution. In the context of aligning large language models (LLM), this objective fuses the direct preference optimization (DPO) loss with the supervised fine-tuning (SFT) loss to help mitigate the overoptimization towards undesired responses, for which we name the algorithm Regularized Preference Optimization (RPO). Experiments of aligning LLMs demonstrate the improved performance of RPO compared with DPO baselines. Our work sheds light on the interplay between preference optimization and SFT in tuning LLMs with both theoretical guarantees and empirical evidence.

翻译：通过RLHF将生成模型与人类偏好对齐通常面临过度优化问题，即不完美的奖励模型可能误导生成模型输出非期望响应。我们通过将错位根源识别为分布偏移和人类偏好学习中的不确定性，以原理性方式研究该问题。为缓解过度优化，我们首先提出一种理论算法，该算法为对抗性选择的奖励模型选择最优策略：同时最小化损失的最大似然估计和奖励惩罚项。其中奖励惩罚项的引入旨在防止策略选择具有虚假高代理奖励的动作，从而在部分覆盖类条件下可证明算法的样本效率。从理论转向实践，所提算法进一步具备等价但意外易于实现的重构形式。利用奖励模型与对应最优策略之间的等价性，该算法具有一个简单目标函数，其结合了：(i) 直接使策略与人类偏好对齐的偏好优化损失；(ii) 通过（适当）基线分布显式模仿策略的监督学习损失。在对齐大语言模型（LLM）的语境中，该目标融合了直接偏好优化（DPO）损失与监督微调（SFT）损失，以帮助缓解对非期望响应的过度优化，我们将此算法命名为正则化偏好优化（RPO）。LLM对齐实验表明，相较于DPO基线方法，RPO具有更优性能。本研究通过理论保证与实证证据，揭示了微调LLM过程中偏好优化与SFT之间的相互作用。