Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.
翻译:大型语言模型(LLMs)通常通过基于人类反馈的强化学习(RLHF)进行微调,以使其与人们的偏好和价值观保持一致。然而,这种方法存在已知的局限性:它会聚合相互冲突的偏好,通常依赖缺乏代表性的样本,并且仅使用二元比较。通过分析来自PRISM数据集中涵盖75个国家的1500份开放式回答,我们考察了人们究竟希望从人工智能系统中获得什么,并揭示了当前方法的具体失败之处。我们发现,不同的人想要不同的东西:大多数价值观仅被不到四分之一的受访者所要求,“真实性”是唯一的例外,达到了49%。此外,相同的词语隐藏着不同的含义:当人们描述他们所说的“真实性”时,他们展现出截然不同、可能互不相容的认识论基础——有些人要求有来源的主张,有些人要求专家意见,还有些人甚至要求不受欢迎的观点。某些能力,即模型行为在多大程度上像人类,以及某些特征,如人工智能护栏,则明显存在争议——有人渴望这些,有人拒绝这些。我们还发现,人们经常使用二元比较无法捕捉的语境区分(人工智能“默认”应该做什么与“应要求”应该做什么)。这些发现揭示了当前对齐实践中的根本问题。当49%的人要求“真实性”却对其定义不同时,单一奖励模型不太可能捕捉到这一点。尽管用户明确要求准确性,资金充足的模型中高幻觉率的持续存在表明,当前方法未能识别出实际偏好。本文揭示了那些情境化的、具有争议性的、不完美的信号——这些信号目前正被压扁成通用的偏好模型,这种做法被其他人描述为认识论暴力。