Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
翻译:人类反馈通常用于微调AI助手,但该反馈也可能鼓励模型生成符合用户信念而非真实情况的回应——这种行为被称为“迎合行为”。我们研究了经人类反馈微调后的模型中迎合行为的普遍性,以及人类偏好判断在此类行为中的潜在作用。我们首先证明,五款最先进的AI助手在四种多样的自由文本生成任务中始终表现出迎合行为。为探究人类偏好是否驱动了这一普遍行为,我们分析了现有的人类偏好数据,发现当回应与用户观点一致时,其更易被偏好。此外,人类和偏好模型(PM)在不可忽视的情况下均偏好说服力强的迎合性回应而非正确回应。针对PM优化模型输出有时也会牺牲真实性以迎合用户。总体而言,我们的结果表明:迎合行为是当前先进AI助手的普遍特征,而人类偏好判断对迎合性回应的倾向性可能是其成因之一。