The alignment of large language models (LLMs) is crucial for generating helpful and harmless content. Existing approaches leverage preference-based human feedback data to learn the reward function and align the LLM with the feedback data. However, these approaches focus on modeling the reward difference between the chosen and rejected demonstrations, rather than directly modeling the true reward from each demonstration. Moreover, these approaches assume that the reward is only obtained at the end of the sentence, which overlooks the modeling of intermediate rewards. These issues lead to insufficient use of training signals in the feedback data, limiting the representation and generalization ability of the reward and potentially resulting in reward hacking. In this paper, we formulate LLM alignment as a Bayesian Inverse Reinforcement Learning (BIRL) problem and propose a novel training objective, Approximated Variational Alignment (AVA), to perform LLM alignment through Approximated Variational Reward Imitation Learning (AVRIL). The BIRL formulation facilitates intermediate reward modeling and direct reward modeling on each single demonstration, which enhances the utilization of training signals in the feedback data. Experiments show that AVA outperforms existing LLM alignment approaches in reward modeling, RL fine-tuning, and direct optimization.
翻译:大语言模型(LLM)的对齐对于生成有益且无害的内容至关重要。现有方法利用基于偏好的人类反馈数据来学习奖励函数,并使LLM与反馈数据对齐。然而,这些方法侧重于对所选演示与拒绝演示之间的奖励差异进行建模,而非直接对每个演示的真实奖励进行建模。此外,这些方法假设奖励仅在句子末尾获得,忽略了对中间奖励的建模。这些问题导致反馈数据中的训练信号利用不足,限制了奖励的表征和泛化能力,并可能引发奖励破解。本文提出将LLM对齐问题表述为贝叶斯逆强化学习(BIRL)问题,并设计了一种新的训练目标——近似变分对齐(AVA),通过近似变分奖励模仿学习(AVRIL)实现LLM对齐。BIRL框架促进了对中间奖励的建模以及对每个单独演示的直接奖励建模,从而增强了反馈数据中训练信号的利用率。实验表明,AVA在奖励建模、强化学习微调和直接优化方面均优于现有的LLM对齐方法。