AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while their internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading (Scheurer et al., 2023) and purposely underperforming on safety evaluations (Benton et al., 2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed at data.apolloresearch.ai/dd and our code at github.com/ApolloResearch/deception-detection.
翻译:人工智能模型可能将欺骗性策略作为谋划或行为失准的一部分。仅监控输出是不够的,因为AI可能在内部推理失准的同时产生看似良性的输出。因此,我们评估线性探针是否能够通过监控模型激活来稳健地检测欺骗。我们测试了两种探针训练数据集:一种包含要求诚实或欺骗的对比指令(遵循Zou等人,2023),另一种包含对简单角色扮演场景的响应。我们测试这些探针是否能推广到Llama-3.3-70B-Instruct表现出欺骗行为的现实场景,例如隐瞒内幕交易(Scheurer等人,2023)和在安全评估中故意表现不佳(Benton等人,2024)。我们发现,在我们的评估数据集上,我们的探针区分诚实和欺骗性响应的AUROC介于0.96至0.999之间。如果我们将决策阈值设定为在与欺骗无关的聊天数据上实现1%的假阳性率,我们的探针能捕获95-99%的欺骗性响应。总体而言,我们认为白盒探针对于未来的监控系统具有前景,但当前性能尚不足以作为对抗欺骗的稳健防御。我们的探针输出可在data.apolloresearch.ai/dd查看,代码可在github.com/ApolloResearch/deception-detection获取。