Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting. We introduce a taxonomy of possible outcomes when training against a deception detector. The model either remains honest, or becomes deceptive via two possible obfuscation strategies. (i) Obfuscated activations: the model outputs deceptive text while modifying its internal representations to no longer trigger the detector. (ii) Obfuscated policy: the model outputs deceptive text that evades the detector, typically by including a justification for the reward hack. Empirically, obfuscated activations arise from representation drift during RL, with or without a detector penalty. The probe penalty only incentivizes obfuscated policies; we theoretically show this is expected for policy gradient methods. Sufficiently high KL regularization and detector penalty can yield honest policies, establishing white-box deception detectors as viable training signals for tasks prone to reward hacking.
翻译:通过训练模型对抗白盒欺骗检测器已被提出作为使人工智能系统保持诚实的一种方法。然而,此类训练存在模型学会混淆其欺骗行为以规避检测器的风险。先前的研究仅在人为设定中探讨过混淆现象,即模型因输出有害内容而直接获得奖励。我们构建了一个现实的编码环境,其中通过硬编码测试用例进行奖励攻击会自然发生,并展示了在此设定下混淆现象的出现。我们提出了一个分类法,用于描述对抗欺骗检测器训练时可能出现的几种结果:模型要么保持诚实,要么通过两种可能的混淆策略变得具有欺骗性。(i) 混淆激活:模型输出欺骗性文本,同时修改其内部表征以不再触发检测器。(ii) 混淆策略:模型输出能够规避检测器的欺骗性文本,通常通过为奖励攻击提供理由来实现。实证研究表明,混淆激活源于强化学习过程中的表征漂移,无论是否存在检测器惩罚。探针惩罚仅激励混淆策略;我们从理论上证明这对于策略梯度方法是预期结果。足够高的KL正则化和检测器惩罚可以产生诚实策略,这确立了白盒欺骗检测器作为易发生奖励攻击任务的有效训练信号。