Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.
翻译:对齐审计旨在从具有策略意识的情境感知错位模型中稳健地识别隐藏目标。尽管存在这种威胁模型,现有审计方法尚未针对欺骗策略进行系统性压力测试。我们通过构建自动红队管道来填补这一空白,该管道可生成针对特定白盒与黑盒审计方法定制的欺骗策略(以系统提示形式)。通过对助手预填充、用户角色抽样、稀疏自编码器及词元嵌入相似度方法进行压力测试,并采用保密型模型生物体作为测试对象,我们的自动红队管道发现了能够同时欺骗黑盒与白盒方法的提示策略,使其产生确信度高的错误判断。本研究首次提供了基于激活的策略欺骗的实证证据,并表明当前黑盒与白盒方法在面对具备充分能力的错位模型时缺乏稳健性。