Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.
翻译:基于人类反馈的强化学习(RLHF)在将大语言模型(LLMs)与人类偏好对齐方面取得了巨大成功。主流的RLHF方法是基于奖励的,遵循Bradley-Terry(BT)模型假设,这可能无法完全捕捉人类偏好的复杂性。在本文中,我们在一个通用偏好框架下探索RLHF,并从博弈论的角度来处理该问题。具体而言,我们将该问题表述为一个双人博弈,并提出了一种新颖的在线算法——迭代纳什策略优化(INPO)。其核心思想是让策略通过无遗憾学习与自身对弈,从而逼近纳什策略。与先前方法不同,INPO绕过了估计单个响应的期望胜率的需求,而这通常会产生高昂的计算或标注成本。相反,我们引入了一个新的损失目标,直接在偏好数据集上进行最小化。我们为我们的方法提供了理论分析,并通过在各种代表性基准测试上的实验证明了其有效性。基于LLaMA-3-8B的SFT模型,INPO在AlpacaEval 2.0上实现了42.6%的长度控制胜率,在Arena-Hard上实现了37.8%的胜率,相较于最先进的在线RLHF算法显示出显著提升。