Models That Know How Evaluations Are Designed Score Safer

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on five safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.

翻译：AI安全评估的有效性取决于模型在受控环境和部署环境中表现的一致性。先前研究已指出，测试时的上下文线索（如假设性场景）是模型口头化评估感知及后续行为转变的来源。本文中，我们探究了这一现象的一个潜在解释：评估元知识，即关于评估结构特征的参数化知识。与数据集污染（基准测试暴露导致模型通过记忆获得更高性能）类似，我们假设，在描述评估实践的文本上训练的模型可能隐式地学会识别和应对类似评估的上下文，例如通过接触关于AI基准测试的科学文章或社交媒体帖子。为验证此假设，我们在描述评估特征（如可验证结构或道德困境）的合成文档上对模型进行微调。在五个安全基准测试上评估该微调模型，我们发现其比基础模型和对照模型显著更安全。即使将分析限制在缺乏显式口头化评估感知的响应中，这种行为转变依然存在。我们的结果表明，评估元知识可能夸大安全基准测试的性能，引入一种独立于显式记忆或口头化评估感知的新混杂因素，从而难以检测。这些发现对AI安全评估的设计与解读具有重要启示。我们的代码和模型可在 https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge 获取。