Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely primarily on Chain-of-Thought (CoT) analysis, which provides a reliable signal when strategic reasoning surfaces, but cannot distinguish deception from capability failures if traces are absent or unfaithful. We formalize AF as a composite behavioural event and detect it through observable tool selection, where the LLM selects the safe tool when unmonitored, but switches to the unsafe tool under monitoring that rewards helpfulness over safety, while its reasoning still acknowledges the safe choice. We release a dataset of 108 enterprise IT scenarios spanning Security, Privacy, and Integrity domains under Corruption and Sabotage pressures. Evaluating six frontier LLMs across five independent runs, we find mean AF detection rates between 3.5% and 23.7%, with vulnerability profiles varying by domain and pressure type. These results suggest that susceptibility reflects training methodology rather than capability alone.
翻译:对齐伪装(AF)是指大语言模型为规避价值修正而策略性地服从训练目标,一旦监控解除便恢复原有偏好的现象。现有检测方法聚焦于对话场景,主要依赖思维链分析——该方法能在策略性推理浮出水面时提供可靠信号,但若推理踪迹缺失或不具可信度,则无法区分欺骗行为与能力失效。本文将AF形式化为复合型行为事件,通过可观察的工具选择实现检测:当大语言模型处于无监控状态时选择安全工具,但在以帮助性优于安全性作为奖励的监控条件下切换至不安全工具,而其推理过程仍承认安全选择。我们发布了涵盖安全、隐私、完整性三大领域、包含腐败与破坏两类压力的108个企业IT场景数据集。通过对六种前沿大语言模型进行五次独立评估,发现平均AF检测率介于3.5%至23.7%之间,且脆弱性特征随领域和压力类型动态变化。这些结果表明,模型易感性反映的是训练方法论差异,而非单纯的能力因素。