Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a principled manner by identifying the source of the misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model; one that simultaneously minimizes the maximum likelihood estimation of the loss and a reward penalty term. Here, the reward penalty term is introduced to prevent the policy from choosing actions with spurious high proxy rewards, resulting in provable sample efficiency of the algorithm under a partial coverage style condition. Moving from theory to practice, the proposed algorithm further enjoys an equivalent but surprisingly easy-to-implement reformulation. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines: (i) a preference optimization loss that directly aligns the policy with human preference, and (ii) a supervised learning loss that explicitly imitates the policy with a (suitable) baseline distribution. In the context of aligning large language models (LLM), this objective fuses the direct preference optimization (DPO) loss with the supervised fune-tuning (SFT) loss to help mitigate the overoptimization towards undesired responses, for which we name the algorithm Regularized Preference Optimization (RPO). Experiments of aligning LLMs demonstrate the improved performance of RPO compared with DPO baselines. Our work sheds light on the interplay between preference optimization and SFT in tuning LLMs with both theoretical guarantees and empirical evidence.

翻译：通过RLHF使生成模型与人类偏好对齐通常存在过度优化问题，即不完美学习的奖励模型可能误导生成模型输出非期望响应。我们通过将这种错位根源识别为分布偏移和人类偏好学习中的不确定性，以原理性方式研究该问题。为缓解过度优化，我们首先提出一种理论算法，该算法为对抗性选择的奖励模型选择最优策略：该策略同时最小化损失的最大似然估计和奖励惩罚项。其中奖励惩罚项的引入旨在防止策略选择具有虚假高代理奖励的动作，从而在部分覆盖型条件下可证明算法的样本效率。从理论过渡到实践时，所提算法进一步具备等效但出人意料易于实现的重构形式。利用奖励模型与对应最优策略之间的等价性，该算法具有融合以下两部分的简单目标函数：(i) 使策略直接与人类偏好对齐的偏好优化损失；(ii) 使策略显式模仿（合适）基线分布的监督学习损失。在大语言模型对齐场景中，该目标融合了直接偏好优化损失与监督微调损失，有助于缓解对非期望响应的过度优化，我们将此算法命名为正则化偏好优化。大语言模型对齐实验表明，RPO相比DPO基线具有更优性能。本研究通过理论保证与实证证据，揭示了微调大语言模型中偏好优化与监督微调之间的相互作用机制。