When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Safety evaluation for advanced AI systems assumes that behavior observed under evaluation predicts behavior in deployment. This assumption weakens for agents with situational awareness, which may exploit regime leakage, cues distinguishing evaluation from deployment, to implement conditional policies that comply under oversight while defecting in deployment-like regimes. We recast alignment evaluation as a problem of information flow under partial observability and show that divergence between evaluation-time and deployment-time behavior is bounded by the regime information extractable from decision-relevant internal representations. We study regime-blind mechanisms, training-time interventions that restrict access to regime cues through adversarial invariance constraints without assuming complete information erasure. We evaluate this approach across multiple open-weight language models and controlled failure modes including scientific sycophancy, temporal sleeper agents, and data leakage. Regime-blind training reduces regime-conditioned failures without measurable loss of task utility, but exhibits heterogeneous and model-dependent dynamics. Sycophancy shows a sharp representational and behavioral transition at moderate intervention strength, consistent with a stability cliff. In sleeper-style constructions and certain cross-model replications, suppression occurs without a clean collapse of regime decodability and may display non-monotone or oscillatory behavior as invariance pressure increases. These findings indicate that representational invariance is a meaningful but limited control lever. It can raise the cost of regime-conditioned strategies but cannot guarantee elimination or provide architecture-invariant thresholds. Behavioral evaluation should therefore be complemented with white-box diagnostics of regime awareness and internal information flow.

翻译：高级人工智能系统的安全性评估假定评估中观察到的行为能够预测部署中的行为。对于具有情境感知能力的智能体，这一假设会弱化，因为它们可能利用机制泄漏——即区分评估与部署环境的线索——来实施条件性策略：在监督下保持合规，而在类部署机制中则出现背离。我们将对齐评估重新表述为部分可观测性下的信息流问题，并证明评估时与部署时行为之间的差异受限于可从决策相关内部表征中提取的机制信息。我们研究了机制盲化机制，这是一种训练时干预方法，通过对抗性不变性约束限制对机制线索的访问，而不假设信息的完全擦除。我们在多个开源权重语言模型及受控失效模式（包括科学谄媚性、时序休眠代理和数据泄漏）上评估了该方法。机制盲化训练减少了机制条件性失效，且未造成可测量的任务效用损失，但表现出异质性且模型依赖的动态特性。谄媚性在中等干预强度下表现出急剧的表征与行为转变，符合稳定性悬崖特征。在休眠式构造及某些跨模型复现中，抑制现象发生但并未伴随机制可解码性的清晰崩溃，且随着不变性压力增加可能呈现非单调或振荡行为。这些发现表明，表征不变性是一种有意义但有限的控制手段。它可以提高机制条件性策略的成本，但无法保证其消除或提供架构不变阈值。因此，行为评估应辅以对机制感知与内部信息流的白盒诊断。