Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment

The alignment of large language models (LLMs) is crucial for generating helpful and harmless content. Existing approaches leverage preference-based human feedback data to learn the reward function and align the LLM with the feedback data. However, these approaches focus on modeling the reward difference between the chosen and rejected demonstrations, rather than directly modeling the true reward from each demonstration. Moreover, these approaches assume that the reward is only obtained at the end of the sentence, which overlooks the modeling of intermediate rewards. These issues lead to insufficient use of training signals in the feedback data, limiting the representation and generalization ability of the reward and potentially resulting in reward hacking. In this paper, we formulate LLM alignment as a Bayesian Inverse Reinforcement Learning (BIRL) problem and propose a novel training objective, Approximated Variational Alignment (AVA), to perform LLM alignment through Approximated Variational Reward Imitation Learning (AVRIL). The BIRL formulation facilitates intermediate reward modeling and direct reward modeling on each single demonstration, which enhances the utilization of training signals in the feedback data. Experiments show that AVA outperforms existing LLM alignment approaches in reward modeling, RL fine-tuning, and direct optimization.

翻译：大语言模型（LLM）的对齐对于生成有益且无害的内容至关重要。现有方法利用基于偏好的人类反馈数据来学习奖励函数，并使LLM与反馈数据对齐。然而，这些方法侧重于对所选演示与拒绝演示之间的奖励差异进行建模，而非直接对每个演示的真实奖励进行建模。此外，这些方法假设奖励仅在句子末尾获得，忽略了对中间奖励的建模。这些问题导致反馈数据中的训练信号利用不足，限制了奖励的表征和泛化能力，并可能引发奖励破解。本文提出将LLM对齐问题表述为贝叶斯逆强化学习（BIRL）问题，并设计了一种新的训练目标——近似变分对齐（AVA），通过近似变分奖励模仿学习（AVRIL）实现LLM对齐。BIRL框架促进了对中间奖励的建模以及对每个单独演示的直接奖励建模，从而增强了反馈数据中训练信号的利用率。实验表明，AVA在奖励建模、强化学习微调和直接优化方面均优于现有的LLM对齐方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日