Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.
翻译:从人类反馈中学习——特别是从成对偏好中学习——最近已成为训练大型语言模型的关键组成部分,并成为大量研究的焦点。近期大多数研究将其视为强化学习问题,其中从成对偏好数据中学习奖励函数,且大型语言模型被视作策略,通过调整以最大化奖励(通常带有额外正则化约束)。我们提出另一种解释,聚焦于成对偏好的生成过程,并将从人类反馈中学习视为密度估计问题。我们通过理论和实证结果表明,对于一类通过偏好行为分布方程定义的生成过程,基于成对偏好训练奖励函数能有效建模标注者的隐含偏好分布。最后,我们讨论并呈现关于"标注者错配"的发现——即对标注者行为做出错误建模假设,导致模型适应不良的失败案例——表明从成对人类偏好中学习的方法在应对具有多样化观点的标注者群体时可能面临困难。