Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a principled manner by identifying the source of the misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model; one that simultaneously minimizes the maximum likelihood estimation of the loss and a reward penalty term. Here, the reward penalty term is introduced to prevent the policy from choosing actions with spurious high proxy rewards, resulting in provable sample efficiency of the algorithm under a partial coverage style condition. Moving from theory to practice, the proposed algorithm further enjoys an equivalent but surprisingly easy-to-implement reformulation. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines: (i) a preference optimization loss that directly aligns the policy with human preference, and (ii) a supervised learning loss that explicitly imitates the policy with a (suitable) baseline distribution. In the context of aligning large language models (LLM), this objective fuses the direct preference optimization (DPO) loss with the supervised fine-tuning (SFT) loss to help mitigate the overoptimization towards undesired responses, for which we name the algorithm Regularized Preference Optimization (RPO). Experiments of aligning LLMs demonstrate the improved performance of RPO compared with DPO baselines. Our work sheds light on the interplay between preference optimization and SFT in tuning LLMs with both theoretical guarantees and empirical evidence.

翻译：通过RLHF将生成模型与人类偏好对齐通常会遭受过度优化问题，即不完美学习的奖励模型可能误导生成模型输出非期望响应。我们通过将这种错位根源识别为分布偏移和学习人类偏好不确定性的形式，以原理性方式研究该问题。为缓解过度优化，我们首先提出一种理论算法，该算法为对抗性选择的奖励模型选择最优策略；该策略同时最小化损失的最大似然估计和奖励惩罚项。其中，奖励惩罚项的引入旨在防止策略选择具有虚假高代理奖励的动作，从而在部分覆盖类型条件下实现算法的可证明样本效率。从理论转向实践，所提算法进一步具备等价但出人意料易于实现的重构形式。利用奖励模型与相应最优策略之间的等价性，该算法具有一个简单目标函数，其结合了：(i) 直接使策略与人类偏好对齐的偏好优化损失；(ii) 通过（合适的）基线分布显式模仿策略的监督学习损失。在对齐大语言模型（LLM）的语境中，该目标融合了直接偏好优化（DPO）损失与监督微调（SFT）损失，有助于缓解对非期望响应的过度优化，我们将该算法命名为正则化偏好优化（RPO）。对齐LLM的实验表明，与DPO基线相比，RPO具有改进的性能。我们的工作通过理论保证和实证证据，揭示了在调整LLM过程中偏好优化与SFT之间的相互作用。