Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.
翻译:基于偏好的对齐方法,其中最著名的是基于人类反馈的强化学习(RLHF),利用人类标注者的判断来塑造大语言模型的行为。然而,这些判断的规范性作用很少被明确阐述。我区分了这三种角色的概念模型。第一种是扩展:标注者扩展了系统设计者自身关于输出应为何物的判断。第二种是证据:标注者提供了关于某些事实(无论是道德、社会还是其他方面)的独立证据。第三种是权威:标注者具有某种独立权威(作为更广泛人群的代表)来决定系统输出。我认为这些模型对RLHF流程应如何征集、验证和聚合标注具有启发意义。我梳理了RLHF及相关方法文献中的里程碑式论文,以说明它们如何隐含地借鉴这些模型,描述了因无意或有意混淆它们而产生的失败模式,并提供了选择它们的规范性标准。我的核心建议是,RLHF流程设计者应将标注分解为可分离的维度,并为每个维度量身定制最适合该维度的模型流程,而非寻求单一的统一流程。