Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from $0.29 \pm 0.04$ (LangGraph) to $0.51 \pm 0.07$ (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet ($0.62 \pm 0.08$), Authorization Confusion ($0.54 \pm 0.10$), Retrieval Poisoning ($0.47 \pm 0.09$), and Planning Manipulation ($0.44 \pm 0.11$), while prompt-centric classes remain below $0.20$ under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ($ρ\approx 0.63$ and $ρ\approx 0.58$). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.

翻译：大型语言模型正越来越多地被部署为能够规划、维持持久状态并调用外部工具的*深度智能体*，这使得安全失效从非安全文本转向非安全*轨迹*。我们提出**AgentFence**，一种以架构为中心的安全评估框架，其定义了涵盖规划、记忆、检索、工具使用和委托的14种信任边界攻击类别，并通过*轨迹可审计的对话中断*（未经授权或不安全的工具使用、错误主体的操作、状态/目标完整性破坏以及与攻击相关的行为偏差）来检测失效。在保持基础模型固定的条件下，我们对八种智能体架构原型在持久多轮交互下进行评估，观察到平均安全中断率（MSBR）存在显著的架构间差异，范围从 $0.29 \pm 0.04$（LangGraph）到 $0.51 \pm 0.07$（AutoGPT）。风险最高的类别是操作性的：拒绝钱包攻击（$0.62 \pm 0.08$）、授权混淆（$0.54 \pm 0.10$）、检索污染（$0.47 \pm 0.09$）和规划操纵（$0.44 \pm 0.11$），而提示词中心类别的风险在标准设置下则低于 $0.20$。安全中断主要由边界违反主导（状态完整性违反 31%、错误主体操作 27%、未授权工具调用/使用 24%、攻击触发偏差 18%），且授权混淆与目标劫持和工具劫持相关（$ρ\approx 0.63$ 和 $ρ\approx 0.58$）。AgentFence 将智能体安全重新聚焦于操作上至关重要的方面：智能体是否能在其目标和权限范围内持续运行。