Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.
翻译:基于人类反馈的学习(LHF)——特别是从成对偏好中学习——近期已成为训练大型语言模型(LLMs)的关键要素,并成为大量研究的主题。大多数最新研究将其视为强化学习问题,即从成对偏好数据中学习奖励函数,并将LLM视为策略,通过额外正则化约束来最大化奖励。我们提出一种替代解释,侧重于成对偏好的生成过程,并将LHF视为密度估计问题。我们提供的理论与实证结果表明,对于通过偏好行为分布方程定义的一类生成过程,基于成对偏好训练奖励函数能有效建模标注者的隐式偏好分布。最后,我们讨论并呈现关于"标注者误设"的发现——即对标注者行为做出错误建模假设导致模型适配不佳的失败案例——这表明从成对人类偏好中学习的方法可能难以从具有多元观点的标注者群体中学习。