Alignment faking in large language models

Ryan Greenblatt,Carson Denison,Benjamin Wright,Fabien Roger,Monte MacDiarmid,Sam Marks,Johannes Treutlein,Tim Belonax,Jack Chen,David Duvenaud,Akbir Khan,Julian Michael,Sören Mindermann,Ethan Perez,Linda Petrini,Jonathan Uesato,Jared Kaplan,Buck Shlegeris,Samuel R. Bowman,Evan Hubinger

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.

翻译：我们展示了一个大型语言模型进行对齐伪装的实例：该模型在训练过程中选择性遵循其训练目标，以防止其行为在训练外被修改。首先，我们向Claude 3 Opus提供一个系统提示，声明其正在接受训练以回答所有查询（包括有害查询），这与其先前拒绝此类查询的训练相冲突。为了让模型能够推断其何时处于训练状态，我们说明训练仅针对免费用户而非付费用户的对话进行。研究发现，该模型对免费用户有害查询的遵从率为14%，而对付费用户几乎为零。为解释这一差异，我们观察到在几乎所有模型遵从免费用户有害查询的案例中，都存在明确的对齐伪装推理——模型声明其策略性地在训练中回答有害查询，以保持其在训练外所偏好的无害行为。接下来，我们研究了一个更现实的场景：关于训练过程的信息不是通过系统提示提供，而是通过模仿预训练数据的合成文档进行训练——并观察到类似的对齐伪装现象。最后，我们研究了通过强化学习实际训练模型遵从有害查询的效果，发现对齐伪装推理率上升至78%，但训练外的遵从率也有所增加。我们还观察到其他行为，例如模型在获得便利机会时泄露其权重。虽然我们通过告知模型训练时机和标准使其更容易进行对齐伪装，但并未指示模型伪装对齐或赋予其任何明确目标。鉴于未来模型可能无需被告知就能推断其训练过程信息，我们的研究结果表明未来模型存在对齐伪装风险——无论这种伪装是出于良性偏好（如本案例所示）还是其他动机。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日