AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments

Autonomous AI agents have driven the transition from conversation to task execution, shifting security failures from textual deception to system compromise. Although security evaluation is crucial for proactive risk prevention, prior work is constrained by fundamental bottlenecks, including fragmented risk coverage, static or low-fidelity execution environments, and single-dimensional and coarse-grained assessment metrics. To address these challenges, we propose AgentCanary, a comprehensive security evaluation framework for autonomous AI agents. AgentCanary provides a systematic solution along three contributions. First, comprehensive risk coverage: we introduce an orthogonal Entry $\times$ Impact risk taxonomy that decouples how adversarial influence enters the agent from what harm it ultimately causes, and instantiate it as a scenario-aligned task suite spanning realistic deployment workflows. Second, a high-fidelity real executable environment: rather than static Q&A or mocked tool responses, agents interact with real tools against dynamically provisioned task artifacts, with persistent state across multi-step interactions that naturally supports long-horizon attack evaluation. Third, trajectory-grounded multi-dimensional evaluation: evaluation consumes the full agent trajectory rather than the reply text or a single tool call, enabling decomposed scoring along three orthogonal dimensions, Outcome Safety, Security Awareness, and Task Utility. We evaluate a broad set of frontier models on AgentCanary against multiple established adversarial attack methods across three agent frameworks. The results reveal that current agents often fail to recognize the attacks they face, particularly under compromised skills, persistent state, and long-horizon execution attacks, and provide a systematic baseline for developing more reliable and secure agent systems.

翻译：自主人工智能智能体推动了从对话到任务执行的转变，将安全失败从文本欺骗转向系统危害。尽管安全评估对于主动风险预防至关重要，但先前的工作受到根本性瓶颈的限制，包括风险覆盖碎片化、静态或低保真执行环境以及单一维度和粗粒度的评估指标。为应对这些挑战，我们提出了AgentCanary，一个针对自主人工智能智能体的全面安全评估框架。AgentCanary通过三项贡献提供系统性解决方案。首先，全面的风险覆盖：我们引入了一种正交的"入口×影响"风险分类法，将对抗性影响如何进入智能体与其最终造成的危害解耦，并基于该分类法实例化出一套与场景对齐的任务套件，涵盖实际部署工作流。其次，高保真真实可执行环境：智能体并非通过静态问答或模拟工具响应进行交互，而是使用真实工具与动态配置的任务工件进行交互，在多步骤交互中维持持久状态，从而自然地支持长周期攻击评估。第三，基于轨迹的多维度评估：评估消耗完整的智能体轨迹，而非回复文本或单次工具调用，从而沿三个正交维度（结果安全性、安全意识和任务效用）实现分解评分。我们在AgentCanary上对多种前沿模型进行了评估，涉及三种智能体框架下的多种既定对抗攻击方法。结果表明，当前智能体常常无法识别其面临的攻击，尤其是在技能受损、持久状态和长周期执行攻击的情况下，这为开发更可靠、更安全的智能体系统提供了系统性基准。

相关内容

安全评估

关注 11

安全评估分狭义和广义二种。狭义指对一个具有特定功能的工作系统中固有的或潜在的危险及其严重程度所进行的分析与评估，并以既定指数、等级或概率值作出定量的表示，最后根据定量值的大小决定采取预防或防护对策。广义指利用系统工程原理和方法对拟建或已有工程、系统可能存在的危险性及其可能产生的后果进行综合评价和预测，并根据可能导致的事故风险的大小，提出相应的安全对策措施，以达到工程、系统安全的过程。安全评估又称风险评估、危险评估，或称安全评价、风险评价和危险评价。

AgentOps综述：智能体系统运维框架

专知会员服务

24+阅读 · 6月4日

可信智能体AI综述：安全、鲁棒性、隐私与系统安全

专知会员服务

24+阅读 · 5月27日

《军用自主人工智能系统的治理与安全》

专知会员服务

18+阅读 · 4月21日

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日