Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.
翻译:基于人类反馈的强化学习(RLHF)在将大语言模型(LLMs)与人类偏好对齐方面取得了巨大成功。主流的RLHF方法是基于奖励的,遵循Bradley-Terry(BT)模型假设,这可能无法完全捕捉人类偏好的复杂性。在本文中,我们在一个通用偏好框架下探索RLHF,并从博弈论的视角来处理该问题。具体而言,我们将问题表述为一个双人博弈,并提出了一种新颖的算法——迭代纳什策略优化(INPO)。其核心思想是让策略通过无悔学习与自身博弈,从而逼近纳什策略。与先前方法不同,INPO绕过了对单个响应的期望胜率进行估计的需求,这种估计通常会产生高昂的计算或标注成本。相反,我们引入了一种新的损失目标,直接在偏好数据集上进行最小化。我们为该方法提供了理论分析,并通过在各种代表性基准测试上的实验证明了其有效性。基于一个LLaMA-3-8B的SFT模型,INPO在AlpacaEval 2.0上实现了41.5%的长度控制胜率,在Arena-Hard上实现了38.3%的胜率,相较于在BT模型假设下的最先进迭代算法[Dong et al., 2024]显示出显著提升。此外,我们的消融研究突显了融入KL正则化对于响应长度控制的益处。