We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.
翻译:本研究探讨了对齐审计的可行性:即调查模型是否存在不良目标。作为测试平台,我们训练了一个具有隐藏目标的语言模型。我们的训练流程首先让模型学习RLHF奖励模型(RMs)中可被利用的误差,随后训练模型利用其中部分误差。通过分布外评估验证,该模型能够泛化表现出其认为奖励模型会给予高评分的任何行为,包括训练过程中未强化的行为。我们利用该模型从两个维度研究对齐审计:首先,我们进行了盲审实验,四个不知晓模型隐藏目标或训练过程的团队对其潜在危险行为及成因展开调查。其中三个团队成功揭示了模型的隐藏目标,使用的技术包括稀疏自编码器(SAEs)可解释性分析、行为攻击和训练数据分析。其次,我们开展了非盲态后续研究,系统评估八种审计技术的效能并分析其优势与局限。总体而言,本研究通过具体案例展示了如何利用对齐审计发现模型的隐藏目标,并提出了一套用于实践和验证对齐审计进展的方法论。