Both online and offline RLHF methods such as PPO and DPO have been extremely successful in aligning AI with human preferences. Despite their success, the existing methods suffer from a fundamental problem that their optimal solution is highly task-dependent (i.e., not robust to out-of-distribution (OOD) tasks). Here we address this challenge by proposing Self-Improving Robust Preference Optimization SRPO, a practical and mathematically principled offline RLHF framework that is completely robust to the changes in the task. The key idea of SRPO is to cast the problem of learning from human preferences as a self-improvement process, which can be mathematically expressed in terms of a min-max objective that aims at joint optimization of self-improvement policy and the generative policy in an adversarial fashion. The solution for this optimization problem is independent of the training task and thus it is robust to its changes. We then show that this objective can be re-expressed in the form of a non-adversarial offline loss which can be optimized using standard supervised optimization techniques at scale without any need for reward model and online inference. We show the effectiveness of SRPO in terms of AI Win-Rate (WR) against human (GOLD) completions. In particular, when SRPO is evaluated on the OOD XSUM dataset, it outperforms the celebrated DPO by a clear margin of 15% after 5 self-revisions, achieving WR of 90%.
翻译:无论是PPO还是DPO等在线或离线RLHF方法,在使人工智能与人类偏好对齐方面都取得了极大成功。尽管这些方法成果显著,但它们存在一个根本性问题:其最优解高度依赖于具体任务(即对分布外任务缺乏鲁棒性)。本文通过提出自改进鲁棒偏好优化SRPO来解决这一挑战——这是一个完全不受任务变化影响的实用且数学原理严密的离线RLHF框架。SRPO的核心思想是将从人类偏好中学习的问题构建为自改进过程,该过程可通过极小极大目标进行数学表达,旨在以对抗方式联合优化自改进策略与生成策略。该优化问题的解独立于训练任务,因此对任务变化具有鲁棒性。我们进一步证明该目标可重新表述为非对抗性离线损失函数,无需奖励模型和在线推理,即可使用标准监督优化技术进行大规模优化。我们通过人工智能相对于人类标注的胜率来验证SRPO的有效性。特别地,当在分布外XSUM数据集上评估时,经过5次自修订后SRPO以15%的显著优势超越公认的DPO方法,达到90%的胜率。