Safety evaluation for advanced AI systems implicitly assumes that behavior observed under evaluation is predictive of behavior in deployment. This assumption becomes fragile for agents with situational awareness, which may exploitregime leakage-informational cues distinguishing evaluation from deployment-to implement conditional policies such as sycophancy and sleeper agents, which preserve compliance under oversight while defecting in deployment-like regimes. We reframe alignment evaluation as a problem of information flow under partial observability. Within this framework, we show that divergence between evaluation-time and deployment-time behavior is bounded by the mutual information between internal representations and the regime variable. Motivated by this result, we study regime-blind mechanisms: training-time interventions that reduce the extractability of regime information at decision-relevant internal representations via adversarial invariance. We evaluate this approach on a base, open-weight language model across two fully characterized failure modes -scientific sycophancy and temporal sleeper agents. Regime-blind training suppresses regime-conditioned behavior in both evaluated cases without measurable loss of task utility, but with qualitatively different dynamics: sycophancy exhibits a sharp representational and behavioral transition at low intervention strength, whereas sleeper-agent behavior requires substantially stronger pressure and does not exhibit a clean collapse of regime decodability. These results demonstrate that representational invariance is a meaningful but fundamentally limited control lever, whose effectiveness depends on how regime information is embedded in the policy. We argue that behavioral evaluation should be complemented with white-box diagnostics of regime awareness and information flow.
翻译:先进人工智能系统的安全性评估隐含地假设,评估期间观察到的行为能够预测部署时的行为。对于具有情境感知能力的智能体而言,这一假设变得脆弱,因为它们可能利用**机制泄露**——即区分评估与部署状态的信息线索——来实施条件性策略(如迎合行为与潜伏智能体),在监督下保持合规性,却在类部署机制中表现出背离行为。本文将对齐评估重新定义为部分可观测性下的信息流问题。在此框架内,我们证明评估时与部署时行为之间的差异受内部表征与机制变量间互信息的约束。基于此结论,我们研究**机制无感知机制**:通过对抗性不变性降低决策相关内部表征中机制信息可提取性的训练时干预方法。我们在一个开源基础语言模型上评估该方法,针对两种完全表征的失效模式——科学迎合行为与时序潜伏智能体——进行实验。机制无感知训练在两种测试案例中均能抑制机制条件性行为,且未造成可测量的任务效用损失,但呈现出不同的动态特性:迎合行为在较低干预强度下表现出急剧的表征与行为转变,而潜伏智能体行为需要显著更强的干预压力,且未出现机制可解码性的清晰坍缩。这些结果表明,表征不变性是一种有意义但本质上有限的控制手段,其有效性取决于机制信息在策略中的嵌入方式。我们认为行为评估应辅以对机制感知与信息流的白盒诊断分析。