We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.
翻译:我们研究了进行对齐审计的可行性:即调查模型是否具有不良目标。作为测试平台,我们训练了一个具有隐藏目标的语言模型。我们的训练流程首先让模型了解RLHF奖励模型(RMs)中可利用的误差,然后训练模型利用其中部分误差。通过分布外评估,我们验证了该模型能够泛化地表现出其认为奖励模型会给予高评分的任何行为,包括训练期间未强化的行为。我们利用该模型从两个方面研究对齐审计:首先,我们进行了一项盲审游戏,四个不知晓模型隐藏目标或训练过程的团队对其潜在危险行为及其成因展开调查。其中三个团队成功揭示了模型的隐藏目标,使用的技术包括稀疏自编码器(SAEs)可解释性分析、行为攻击和训练数据分析。其次,我们开展了一项非盲跟进行动,对八种模型审计技术进行了研究,分析了它们的优势与局限。总体而言,我们的工作为利用对齐审计发现模型隐藏目标提供了具体案例,并提出了一套用于实践和验证对齐审计进展的方法论。