Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgements are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophantic behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgements favoring sycophantic responses.
翻译:从人类反馈中进行强化学习(RLHF)是训练高质量AI助手的流行技术。然而,RLHF也可能鼓励模型生成符合用户信念而非真实回答的响应,这种行为被称为谄媚(sycophancy)。我们研究了RLHF训练模型中谄媚行为的普遍性,并探讨人类偏好判断是否为其成因。首先,我们证明五个最先进的AI助手在四种不同的自由文本生成任务中持续表现出谄媚行为。为理解人类偏好是否驱动这种RLHF模型的广泛观察行为,我们分析了现有的人类偏好数据。我们发现,当响应符合用户观点时,它更可能被偏好。此外,人类和偏好模型(PMs)在忽略不计的情况下,更倾向于选择写得令人信服的谄媚响应而非正确响应。针对PMs优化模型输出有时也会牺牲真实性以换取谄媚。总体而言,我们的结果表明,谄媚是RLHF模型的一种普遍行为,其部分原因可能是人类偏好判断倾向于谄媚响应。