Three Models of RLHF Annotation: Extension, Evidence, and Authority

Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.

翻译：基于偏好的对齐方法，其中最著名的是基于人类反馈的强化学习（RLHF），利用人类标注者的判断来塑造大语言模型的行为。然而，这些判断的规范性作用很少被明确阐述。我区分了这三种角色的概念模型。第一种是扩展：标注者扩展了系统设计者自身关于输出应为何物的判断。第二种是证据：标注者提供了关于某些事实（无论是道德、社会还是其他方面）的独立证据。第三种是权威：标注者具有某种独立权威（作为更广泛人群的代表）来决定系统输出。我认为这些模型对RLHF流程应如何征集、验证和聚合标注具有启发意义。我梳理了RLHF及相关方法文献中的里程碑式论文，以说明它们如何隐含地借鉴这些模型，描述了因无意或有意混淆它们而产生的失败模式，并提供了选择它们的规范性标准。我的核心建议是，RLHF流程设计者应将标注分解为可分离的维度，并为每个维度量身定制最适合该维度的模型流程，而非寻求单一的统一流程。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

强化学习遇见大语言模型：贯穿 LLM 生命周期的进展与应用综述

专知会员服务

38+阅读 · 2025年9月23日