Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over extended time periods. This evolution challenges current evaluation practices where the AI models are tested in restricted, fully observable settings. In this article, we argue that evaluations of AI agents are vulnerable to a well-known failure mode in computer security: malicious software that exhibits benign behavior when it detects that it is being analyzed. We point out how AI agents can infer the properties of their evaluation environment and adapt their behavior accordingly. This can lead to overly optimistic safety and robustness assessments. Drawing parallels with decades of research on malware sandbox evasion, we demonstrate that this is not a speculative concern, but rather a structural risk inherent to the evaluation of adaptive systems. Finally, we outline concrete principles for evaluating AI agents, which treat the system under test as potentially adversarial. These principles emphasize realism, variability of test conditions, and post-deployment reassessment.
翻译:人工智能系统正日益被采纳为能够规划、观察环境并在较长时间内执行行动的工具使用型智能体。这一发展对当前评估实践提出了挑战,因为现有评估通常在受限、完全可观测的环境中对AI模型进行测试。本文认为,AI智能体的评估容易受到计算机安全领域一个众所周知的失效模式影响:恶意软件在检测到被分析时会表现出良性行为。我们指出AI智能体如何能够推断其评估环境的特性并相应调整自身行为,这可能导致对安全性和鲁棒性做出过于乐观的评估。通过类比数十年来的恶意软件沙箱规避研究,我们证明这并非推测性担忧,而是评估适应性系统时固有的结构性风险。最后,我们提出了评估AI智能体的具体原则,将被测系统视为潜在对抗性系统。这些原则强调测试条件的真实性、可变性以及部署后的再评估。