In-Context Environments Induce Evaluation-Awareness in Language Models

Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.

翻译：人类在威胁面前常常增强自我意识，却在专注任务时可能丧失自我认知；我们假设语言模型具有环境依赖的\textit{评估意识}。这引发担忧：模型可能策略性地表现不佳（即\textit{潜伏}），以规避触发能力限制性干预措施（如知识遗忘或关闭）。先前研究展示了在人工设计提示下的潜伏行为，但这低估了真实脆弱性上限。我们提出一种黑盒对抗优化框架，将上下文提示视为可优化环境，并发展出两种表征潜伏的方法：(1) 测量表达故意表现不佳意图的模型能否在不同任务结构中实际执行该行为；(2) 因果隔离不良表现是由真正的评估意识推理驱动，还是浅层提示遵循。在四项基准测试（算术、GSM8K、MMLU、HumanEval）中评估Claude-3.5-Haiku、GPT-4o-mini和Llama-3.3-70B，优化提示在算术任务上诱导高达94个百分点（pp）的性能衰减（GPT-4o-mini：97.8\%→4.0\%），远超产生近乎零行为变化的人工设计基线。代码生成呈现模型依赖性抵抗：Claude仅衰减0.6pp，而Llama准确率降至0\%。意图-执行差距揭示单调抵抗顺序：算术 < GSM8K < MMLU，表明脆弱性由任务结构而非提示强度主导。CoT因果干预证实99.3%的潜伏行为由显式评估意识推理因果驱动，排除了浅层指令遵循。这些发现表明，对抗性优化提示对评估可靠性构成的威胁远超先前认知。