Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.
翻译:从人类反馈学习(LHF)——特别是从成对偏好中学习——近来已成为训练大型语言模型(LLM)的关键要素,并引发了大量研究。近期多数工作将其视为强化学习问题,其中奖励函数从成对偏好数据中学习得到,而LLM则被视为一种策略,在附加正则化约束下通过最大化奖励来适配。我们提出一种替代性解释,聚焦于成对偏好的生成过程,并将LHF视为密度估计问题。我们通过理论与实验表明,对于由偏好行为分布方程定义的生成过程族,基于成对偏好训练的奖励函数有效建模了注释者的隐式偏好分布。最后,我们讨论并呈现关于"注释者设定错误"的发现——即对注释者行为作出错误建模假设所导致的模型适配不良案例——这表明从成对人类偏好中学习的方法在应对具有多样化观点的注释者群体时可能面临困难。