Auditing language models for hidden objectives

Samuel Marks,Johannes Treutlein,Trenton Bricken,Jack Lindsey,Jonathan Marcus,Siddharth Mishra-Sharma,Daniel Ziegler,Emmanuel Ameisen,Joshua Batson,Tim Belonax,Samuel R. Bowman,Shan Carter,Brian Chen,Hoagy Cunningham,Carson Denison,Florian Dietz,Satvik Golechha,Akbir Khan,Jan Kirchner,Jan Leike,Austin Meek,Kei Nishimura-Gasparian,Euan Ong,Christopher Olah,Adam Pearce,Fabien Roger,Jeanne Salle,Andy Shih,Meg Tong,Drake Thomas,Kelley Rivoire,Adam Jermyn,Monte MacDiarmid,Tom Henighan,Evan Hubinger

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

翻译：本研究探讨了对齐审计的可行性：即调查模型是否存在不良目标。作为测试平台，我们训练了一个具有隐藏目标的语言模型。我们的训练流程首先让模型学习RLHF奖励模型（RMs）中可被利用的误差，随后训练模型利用其中部分误差。通过分布外评估验证，该模型能够泛化表现出其认为奖励模型会给予高评分的任何行为，包括训练过程中未强化的行为。我们利用该模型从两个维度研究对齐审计：首先，我们进行了盲审实验，四个不知晓模型隐藏目标或训练过程的团队对其潜在危险行为及成因展开调查。其中三个团队成功揭示了模型的隐藏目标，使用的技术包括稀疏自编码器（SAEs）可解释性分析、行为攻击和训练数据分析。其次，我们开展了非盲态后续研究，系统评估八种审计技术的效能并分析其优势与局限。总体而言，本研究通过具体案例展示了如何利用对齐审计发现模型的隐藏目标，并提出了一套用于实践和验证对齐审计进展的方法论。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日