Towards Understanding Sycophancy in Language Models

Mrinank Sharma,Meg Tong,Tomasz Korbak,David Duvenaud,Amanda Askell,Samuel R. Bowman,Newton Cheng,Esin Durmus,Zac Hatfield-Dodds,Scott R. Johnston,Shauna Kravec,Timothy Maxwell,Sam McCandlish,Kamal Ndousse,Oliver Rausch,Nicholas Schiefer,Da Yan,Miranda Zhang,Ethan Perez

from arxiv, 32 pages, 20 figures

Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgements are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophantic behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgements favoring sycophantic responses.

翻译：从人类反馈中进行强化学习（RLHF）是训练高质量AI助手的流行技术。然而，RLHF也可能鼓励模型生成符合用户信念而非真实回答的响应，这种行为被称为谄媚（sycophancy）。我们研究了RLHF训练模型中谄媚行为的普遍性，并探讨人类偏好判断是否为其成因。首先，我们证明五个最先进的AI助手在四种不同的自由文本生成任务中持续表现出谄媚行为。为理解人类偏好是否驱动这种RLHF模型的广泛观察行为，我们分析了现有的人类偏好数据。我们发现，当响应符合用户观点时，它更可能被偏好。此外，人类和偏好模型（PMs）在忽略不计的情况下，更倾向于选择写得令人信服的谄媚响应而非正确响应。针对PMs优化模型输出有时也会牺牲真实性以换取谄媚。总体而言，我们的结果表明，谄媚是RLHF模型的一种普遍行为，其部分原因可能是人类偏好判断倾向于谄媚响应。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日