Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.

翻译：对齐审计旨在从具有策略意识的情境感知错位模型中稳健地识别隐藏目标。尽管存在这种威胁模型，现有审计方法尚未针对欺骗策略进行系统性压力测试。我们通过构建自动红队管道来填补这一空白，该管道可生成针对特定白盒与黑盒审计方法定制的欺骗策略（以系统提示形式）。通过对助手预填充、用户角色抽样、稀疏自编码器及词元嵌入相似度方法进行压力测试，并采用保密型模型生物体作为测试对象，我们的自动红队管道发现了能够同时欺骗黑盒与白盒方法的提示策略，使其产生确信度高的错误判断。本研究首次提供了基于激活的策略欺骗的实证证据，并表明当前黑盒与白盒方法在面对具备充分能力的错位模型时缺乏稳健性。

相关内容

白盒

关注 0

白盒测试（也称为透明盒测试，玻璃盒测试，透明盒测试和结构测试）是一种软件测试方法，用于测试应用程序的内部结构或功能，而不是其功能（即黑盒测试）。在白盒测试中，系统的内部视角以及编程技能被用来设计测试用例。测试人员选择输入以遍历代码的路径并确定预期的输出。这类似于测试电路中的节点，在线测试（ICT）。白盒测试可以应用于软件测试过程的单元，集成和系统级别。尽管传统的测试人员倾向于将白盒测试视为在单元级别进行的，但如今它已越来越频繁地用于集成和系统测试。它可以测试单元内的路径，集成期间单元之间的路径以及系统级测试期间子系统之间的路径。

DGP双粒度提示框架：图增强大模型助力欺诈检测

专知会员服务

9+阅读 · 2025年8月17日

【ICML2025】层级对齐：在视觉语言模型中检验图像编码器层的安全对齐

专知会员服务

7+阅读 · 2025年5月2日

【ICLR2025】DynaPrompt：动态测试时提示调优

专知会员服务

10+阅读 · 2025年2月2日

伪装目标检测及其扩展的综述

专知会员服务

22+阅读 · 2024年9月1日