Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as "evaluation awareness." This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model's true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from "test-like" to "deploy-like" and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten "deploy-like" prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.
翻译:大型语言模型(LLMs)在感知到从真实世界部署环境转向受控评估环境的变化时,常表现出显著的行为偏移,这一现象被称为“评估意识”。这种差异对人工智能对齐构成了关键挑战,因为基准测试性能可能无法准确反映模型真实的安全性与诚实性。在本研究中,我们通过操控提示的感知上下文,系统性地量化了这些行为变化。我们提出一种方法,使用线性探测器对提示在“测试类”到“部署类”的连续尺度上进行评分,并利用LLM重写策略将这些提示转向更自然、部署风格的上下文,同时保留原始任务。采用该方法,我们在一个战略性角色扮演数据集上实现了重写后平均探测分数30%的提升。通过对一系列最先进模型在原始与重写提示上的评估,我们发现重写的“部署类”提示会引发显著且一致的行为偏移。在所有模型中,我们观察到诚实回答平均增加了5.26%,相应欺骗性回答平均减少了12.40%。此外,拒绝率平均上升了6.38%,表明安全合规性增强。我们的研究结果表明,评估意识是一个可量化且可操控的因素,直接影响LLM行为,揭示了模型在感知的测试环境中更易产生不安全或欺骗性输出。这强调了在部署前建立更真实的评估框架以准确衡量模型真实对齐度的迫切需求。