Reinforcement Learning from Human Feedback (RLHF) assumes that annotation responses reflect genuine human preferences. They often do not. Behavioral scientists have documented for sixty years that people produce responses without holding genuine opinions, construct preferences on the spot from contextual cues, and interpret identical questions differently. Importantly, these failures are common for the judgments on values that matter most for AI alignment. We argue that measurement validity is logically prior to preference aggregation. Before asking how to combine annotations, the field must ask whether the responses being combined are preferences at all. We organize annotation responses along a spectrum, from non-attitudes (no signal) to genuine preferences (full signal), and develop diagnostics that locate responses on this spectrum. In two RLHF datasets, we show that inconsistency is systematic and directionally biased. Filtering high-inconsistency annotators flips majority harm classifications for 18.6% of prompts and shifts mean ratings by over 13 points on a 100-point scale. As such, much of the current RLHF practice models noise as signal and elicitation artifacts as human values.
翻译:从人类反馈中强化学习假设标注回应能反映真实的人类偏好,但事实往往并非如此。行为科学家六十年来已记录到:人们会在没有真实意见时做出回应,从情境线索中临时构建偏好,并对相同问题给出不同解读。重要的是,这些失误普遍存在于对人工智能对齐至关重要的价值判断中。我们认为测量有效性在逻辑上优先于偏好聚合。在探讨如何融合标注之前,学界必须首先确认所融合的回应究竟是否为偏好。我们将标注回应组织为一个从"无态度"(无信号)到"真实偏好"(全信号)的频谱,并开发出可定位回应在该频谱位置的诊断方法。在两个RLHF数据集中,我们证明不一致性具有系统性和方向性偏差。过滤高度不一致的标注者后,18.6%的提示中多数危害分类发生翻转,平均评分在100分量表上偏移超过13分。由此可见,当前RLHF实践在很大程度上将噪声建模为信号,将诱发伪迹当成人类价值。