Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.

翻译：基于人类反馈的强化学习（RLHF）在将大语言模型（LLMs）与人类偏好对齐方面取得了巨大成功。主流的RLHF方法是基于奖励的，遵循Bradley-Terry（BT）模型假设，这可能无法完全捕捉人类偏好的复杂性。在本文中，我们在一个通用偏好框架下探索RLHF，并从博弈论的视角来处理该问题。具体而言，我们将问题表述为一个双人博弈，并提出了一种新颖的算法——迭代纳什策略优化（INPO）。其核心思想是让策略通过无悔学习与自身博弈，从而逼近纳什策略。与先前方法不同，INPO绕过了对单个响应的期望胜率进行估计的需求，这种估计通常会产生高昂的计算或标注成本。相反，我们引入了一种新的损失目标，直接在偏好数据集上进行最小化。我们为该方法提供了理论分析，并通过在各种代表性基准测试上的实验证明了其有效性。基于一个LLaMA-3-8B的SFT模型，INPO在AlpacaEval 2.0上实现了41.5%的长度控制胜率，在Arena-Hard上实现了38.3%的胜率，相较于在BT模型假设下的最先进迭代算法[Dong et al., 2024]显示出显著提升。此外，我们的消融研究突显了融入KL正则化对于响应长度控制的益处。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日