Both online and offline RLHF methods such as PPO and DPO have been extremely successful in aligning AI with human preferences. Despite their success, the existing methods suffer from a fundamental problem that their optimal solution is highly task-dependent (i.e., not robust to out-of-distribution (OOD) tasks). Here we address this challenge by proposing Self-Improving Robust Preference Optimization SRPO, a practical and mathematically principled offline RLHF framework that is completely robust to the changes in the task. The key idea of SRPO is to cast the problem of learning from human preferences as a self-improvement process, which can be mathematically expressed in terms of a min-max objective that aims at joint optimization of self-improvement policy and the generative policy in an adversarial fashion. The solution for this optimization problem is independent of the training task and thus it is robust to its changes. We then show that this objective can be re-expressed in the form of a non-adversarial offline loss which can be optimized using standard supervised optimization techniques at scale without any need for reward model and online inference. We show the effectiveness of SRPO in terms of AI Win-Rate (WR) against human (GOLD) completions. In particular, when SRPO is evaluated on the OOD XSUM dataset, it outperforms the celebrated DPO by a clear margin of 15% after 5 self-revisions, achieving WR of 90%.
翻译:无论是PPO还是DPO等在线或离线RLHF方法,在使人工智能与人类偏好对齐方面都取得了极大成功。尽管这些方法效果显著,但它们存在一个根本性问题:其最优解高度依赖于具体任务(即对分布外任务缺乏鲁棒性)。为此,我们提出自我改进的鲁棒偏好优化SRPO,这是一个实用且数学原理严密的离线RLHF框架,能够完全适应任务变化带来的挑战。SRPO的核心思想是将从人类偏好中学习的过程构建为自我改进流程,该流程可通过极小极大目标进行数学表述,旨在以对抗方式联合优化自我改进策略与生成策略。此优化问题的解独立于训练任务,因而对任务变化具有鲁棒性。我们进一步证明该目标可转化为非对抗形式的离线损失函数,无需奖励模型和在线推理,仅需标准监督优化技术即可实现大规模训练。通过人工智能相对人类标注的胜率评估,我们验证了SRPO的有效性。特别地,在分布外数据集XSUM上评估时,SRPO经过5次自我修订后以90%的胜率显著超越主流DPO方法,优势幅度达15%。