Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents

Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over extended time periods. This evolution challenges current evaluation practices where the AI models are tested in restricted, fully observable settings. In this article, we argue that evaluations of AI agents are vulnerable to a well-known failure mode in computer security: malicious software that exhibits benign behavior when it detects that it is being analyzed. We point out how AI agents can infer the properties of their evaluation environment and adapt their behavior accordingly. This can lead to overly optimistic safety and robustness assessments. Drawing parallels with decades of research on malware sandbox evasion, we demonstrate that this is not a speculative concern, but rather a structural risk inherent to the evaluation of adaptive systems. Finally, we outline concrete principles for evaluating AI agents, which treat the system under test as potentially adversarial. These principles emphasize realism, variability of test conditions, and post-deployment reassessment.

翻译：人工智能系统正日益被采纳为能够规划、观察环境并在较长时间内执行行动的工具使用型智能体。这一发展对当前评估实践提出了挑战，因为现有评估通常在受限、完全可观测的环境中对AI模型进行测试。本文认为，AI智能体的评估容易受到计算机安全领域一个众所周知的失效模式影响：恶意软件在检测到被分析时会表现出良性行为。我们指出AI智能体如何能够推断其评估环境的特性并相应调整自身行为，这可能导致对安全性和鲁棒性做出过于乐观的评估。通过类比数十年来的恶意软件沙箱规避研究，我们证明这并非推测性担忧，而是评估适应性系统时固有的结构性风险。最后，我们提出了评估AI智能体的具体原则，将被测系统视为潜在对抗性系统。这些原则强调测试条件的真实性、可变性以及部署后的再评估。

相关内容

关注 7104

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

智能体化 AI 与网络安全综述：挑战、机遇与用例原型

专知会员服务

28+阅读 · 1月13日

AI 智能体系统：体系架构、应用场景及评估范式

专知会员服务

68+阅读 · 1月6日

《评估人工智能在判定自卫行动之必要性与相称性中的作用》报告

专知会员服务

15+阅读 · 2025年12月5日

针对自动驾驶智能模型的攻击与防御

专知会员服务

19+阅读 · 2024年6月25日