A Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference

Reinforcement Learning from Human Feedback (RLHF) learns from the preference signal provided by a probabilistic preference model, which takes a prompt and two responses as input, and produces a score indicating the preference of one response against another. So far, the most popular RLHF paradigm is reward-based, which starts with an initial step of reward modeling, and the constructed reward is then used to provide a reward signal for the subsequent reward optimization stage. However, the existence of a reward function is a strong assumption and the reward-based RLHF is limited in expressivity and cannot capture the real-world complicated human preference. In this work, we provide theoretical insights for a recently proposed learning paradigm, Nash learning from human feedback (NLHF), which considered a general preference model and formulated the alignment process as a game between two competitive LLMs. The learning objective is to find a policy that consistently generates responses preferred over any competing policy while staying close to the initial model. The objective is defined as the Nash equilibrium (NE) of the KL-regularized preference model. We aim to make the first attempt to study the theoretical learnability of the KL-regularized NLHF by considering both offline and online settings. For the offline learning from a pre-collected dataset, we propose algorithms that are efficient under suitable coverage conditions of the dataset. For batch online learning from iterative interactions with a preference oracle, our proposed algorithm enjoys a finite sample guarantee under the structural condition of the underlying preference model. Our results connect the new NLHF paradigm with traditional RL theory, and validate the potential of reward-model-free learning under general preference.

翻译：从人类反馈的强化学习（RLHF）通过概率偏好模型提供的偏好信号进行学习，该模型以提示和两个响应作为输入，并生成一个分数来指示一个响应相对于另一个响应的偏好程度。迄今为止，最流行的RLHF范式是基于奖励的，其始于奖励建模的初始步骤，构建的奖励随后用于为后续的奖励优化阶段提供奖励信号。然而，奖励函数的存在是一个强假设，基于奖励的RLHF表达能力有限，无法捕捉现实世界中复杂的人类偏好。在本工作中，我们为最近提出的学习范式——从人类反馈中学习Nash均衡（NLHF）——提供了理论见解，该范式考虑了通用偏好模型，并将对齐过程建模为两个竞争性大型语言模型之间的博弈。学习目标是在保持与初始模型接近的同时，找到一个能持续生成优于任何竞争模型响应的策略。该目标被定义为KL正则化偏好模型的Nash均衡。我们旨在通过考虑离线和在线设置，首次尝试研究KL正则化NLHF的理论可学习性。对于从预收集数据集的离线学习，我们提出了在数据集的适当覆盖条件下高效的算法。对于通过与偏好断言器的迭代交互进行的批量在线学习，我们提出的算法在底层偏好模型的结构条件下享有有限样本保证。我们的结果将新的NLHF范式与传统强化学习理论联系起来，并验证了在通用偏好下无奖励模型学习的潜力。